Active Learning by Acquiring Contrastive Examples

Common acquisition functions for active learning use either uncertainty or diversity sampling, aiming to select difficult and diverse data points from the pool of unlabeled data, respectively. In this work, leveraging the best of both worlds, we propose an acquisition function that opts for selecting contrastive examples, i.e. data points that are similar in the model feature space and yet the model outputs maximally different predictive likelihoods. We compare our approach, CAL (Contrastive Active Learning), with a diverse set of acquisition functions in four natural language understanding tasks and seven datasets. Our experiments show that CAL performs consistently better or equal than the best performing baseline across all tasks, on both in-domain and out-of-domain data. We also conduct an extensive ablation study of our method and we further analyze all actively acquired datasets showing that CAL achieves a better trade-off between uncertainty and diversity compared to other strategies.


Introduction
Active learning (AL) is a machine learning paradigm for efficiently acquiring data for annotation from a (typically large) pool of unlabeled data (Lewis and Catlett, 1994;Cohn et al., 1996;Settles, 2009). Its goal is to concentrate the human labeling effort on the most informative data points that will benefit model performance the most and thus reducing data annotation cost.
The most widely used approaches to acquiring data for AL are based on uncertainty and diversity, often described as the "two faces of AL" (Dasgupta, 2011). While uncertainty-based methods leverage the model predictive confidence to select difficult examples for annotation (Lewis and Gale, 1994;Cohn et al., 1996), diversity sampling exploits heterogeneity in the feature space by typically performing clustering (Brinker, 2003;Bodó Figure 1: Illustrative example of our proposed method CAL. The solid line (model decision boundary) separates data points from two different classes (blue and orange), the coloured data points represent the labeled data and the rest are the unlabeled data of the pool. et al., 2011). Still, both approaches have core limitations that may lead to acquiring redundant data points. Algorithms based on uncertainty may end up choosing uncertain yet uninformative repetitive data, while diversity-based methods may tend to select diverse yet easy examples for the model (Roy and McCallum, 2001). The two approaches are orthogonal to each other, since uncertainty sampling is usually based on the model's output, while diversity exploits information from the input (i.e. feature) space. Hybrid data acquisition functions that combine uncertainty and diversity sampling have also been proposed (Shen et al., 2004;Zhu et al., 2008;Ducoffe and Precioso, 2018;Ash et al., 2020;Yuan et al., 2020;Ru et al., 2020).
In this work, we aim to leverage characteristics from hybrid data acquisition. We hypothesize that data points that are close in the model feature space (i.e. share similar or related vocabulary, or similar model encodings) but the model produces different predictive likelihoods, should be good candidates for data acquisition. We define such examples as contrastive (see example in Figure 1). For that purpose, we propose a new acquisition function that searches for contrastive examples in the pool of unlabeled data. Specifically, our method, Contrastive Active Learning (CAL) selects unlabeled data points from the pool, whose predictive likelihoods diverge the most from their neighbors in the training set. This way, CAL shares similarities with diversity sampling, but instead of performing clustering it uses the feature space to create neighborhoods. CAL also leverages uncertainty, by using predictive likelihoods to rank the unlabeled data.
We evaluate our approach in seven datasets from four tasks including sentiment analysis, topic classification, natural language inference and paraphrase detection. We compare CAL against a full suite of baseline acquisition functions that are based on uncertainty, diversity or both. We also examine robustness by evaluating on out-of-domain data, apart from in-domain held-out sets. Our contributions are the following: 1. We propose CAL, a new acquisition function for active learning that acquires contrastive examples from the pool of unlabeled data ( §2); 2. We show that CAL performs consistently better or equal compared to all baselines in all tasks when evaluated on in-domain and outof-domain settings ( §4); 3. We conduct a thorough analysis of our method showing that CAL achieves a better trade-off between diversity and uncertainty compared to the baselines ( §6).
We release our code online 1 .

Contrastive Active Learning
In this section we present in detail our proposed method, CAL: Contrastive Active Learning. First, we provide a definition for contrastive examples and how they are related to finding data points that are close to the decision boundary of the model ( §2.1). We next describe an active learning loop using our proposed acquisition function ( §2.2).

Contrastive Examples
In the context of active learning, we aim to formulate an acquisition function that selects contrastive examples from a pool of unlabeled data for annotation. We draw inspiration from the contrastive learning framework, that leverages the similarity between data points to push those from the same class closer together and examples from different classes further apart during training (Mikolov et al., 2013;Sohn, 2016;van den Oord et al., 2019;Chen et al., 2020;Gunel et al., 2021). In this work, we define as contrastive examples two data points if their model encodings are similar, but their model predictions are very different (maximally disagreeing predictive likelihoods).
Formally, data points x i and x j should first satisfy a similarity criterion: where Φ(.) ∈ R d is an encoder that maps x i , x j in a shared feature space, d(.) is a distance metric and is a small distance value. A second criterion, based on model uncertainty, is to evaluate that the predictive probability distributions of the model p(y|x i ) and p(y|x j ) for the inputs x i and x j should maximally diverge: where KL is the Kullback-Leibler divergence between two probability distributions 2 . For example, in a binary classification problem, given a reference example x 1 with output probability distribution (0.8, 0.2) 3 and similar candidate examples x 2 with (0.7, 0.3) and x 3 with (0.6, 0.4), we would consider as contrastive examples the pair (x 1 , x 3 ). However, if another example x 4 (similar to x 1 in the model feature space) had a probability distribution (0.4, 0.6), then the most contrastive pair would be (x 1 , x 4 ). Figure 1 provides an illustration of contrastive examples for a binary classification case. All data points inside the circle (dotted line) are similar in the model feature space, satisfying Eq. 1. Intuitively, if the divergence of the output probabilities of the model for the gray and blue shaded data points is high, then Eq. 2 should also hold and we should consider them as contrastive.
From a different perspective, data points with similar model encodings (Eq. 1) and dissimilar model outputs (Eq. 2), should be close to the model's decision boundary ( Figure 1). Hence, we hypothesize that our proposed approach to select 2 KL divergence is not a symmetric metric, KL(P ||Q) = x P (x)log P (x) Q(x) . We use as input Q the output probability distribution of an unlabeled example from the pool and as target P the output probability distribution of an example from the train set (See §2.2 and algorithm 1). 3 A predictive distribution (0.8, 0.2) here denotes that the model is 80% confident that x1 belongs to the first class and 20% to the second.
Algorithm 1 Single iteration of CAL Input: labeled data D lab , unlabeled data D pool , acquisition size b, model M, number of neighbours k, model representation (encoding) function Φ(.)

Active Learning Loop
Assuming a multi-class classification problem with C classes, labeled data for training D lab and a pool of unlabeled data D pool , we perform AL for T iterations. At each iteration, we train a model on D lab and then use our proposed acquisition function, CAL (Algorithm 1), to acquire a batch Q consisting of b examples from D pool . The acquired examples are then labeled 4 , they are removed from the pool D pool and added to the labeled dataset D lab , which will serve as the training set for training a model in the next AL iteration. In our experiments, we use a pretrained BERT model M (Devlin et al., 2019), which we fine-tune at each AL iteration using the current D lab . We begin the AL loop by training a model M using an initial labeled dataset D lab 5 . 4 We simulate AL, so we already have the labels of the examples of D pool (but still treat it as an unlabeled dataset). 5 We acquire the first examples that form the initial training set D lab by applying random stratified sampling (i.e. keeping the initial label distribution).
Find Nearest Neighbors for Unlabeled Candidates The first step of our contrastive acquisition function (cf. line 2) is to find examples that are similar in the model feature space (Eq. 1). Specifically, we use the [CLS] token embedding of BERT as our encoder Φ(.) to represent all data points in D lab and D pool . We use a K-Nearest-Neighbors (KNN) implementation using the labeled data D lab , in order to query similar examples x l ∈ D lab for each candidate x p ∈ D pool . Our distance metric d(.) is Euclidean distance. To find the most similar data points in D lab for each x p , we select the top k instead of selecting a predefined threshold (Eq. 1) 6 . This way, we create a neighborhood N xp = x p , x

Compute Contrastive Score between Unlabeled
Candidates and Neighbors In the second step, we compute the divergence in the model predictive probabilities for the members of the neighborhood (Eq. 2). Using the current trained model M to obtain the output probabilities for all data points in N xp (cf. lines 3-4), we then compute the Kullback-Leibler divergence (KL) between the output probabilities of x p and all x l ∈ N xp (cf. line 5). To obtain a score s xp for a candidate x p , we take the average of all divergence scores (cf. line 6).

Rank Unlabeled Candidates and Select Batch
We apply these steps to all candidate examples x p ∈ D pool and obtain a score s xp for each. With 3 Experimental Setup

Tasks & Datasets
We conduct experiments on sentiment analysis, topic classification, natural language inference and paraphrase detection tasks. We provide details for the datasets in Table 1. We follow Yuan et al. (2020) and use IMDB (Maas et al., 2011), SST-2 (Socher et al., 2013), PUBMED (Dernoncourt and Lee, 2017) and AGNEWS from Zhang et al. (2015) where we also acquired DBPEDIA. We experiment with tasks requiring pairs of input sequences, using QQP and QNLI from GLUE (Wang et al., 2019). To evaluate robustness on out-of-distribution (OOD) data, we follow Hendrycks et al. (2020) and use SST-2 as OOD dataset for IMDB and vice versa. We finally use TWITTERPPDB (Lan et al., 2017) as OOD data for QQP as in Desai and Durrett (2020).

Baselines
We compare CAL against five baseline acquisition functions. The first method, ENTROPY is the most commonly used uncertainty-based baseline that acquires data points for which the model has the highest predictive entropy. As a diversity-based baseline, following Yuan et al. (2020), we use BERTKM that applies k-means clustering using the l 2 normalized BERT output embeddings of the fine-tuned model to select b data points. We compare against BADGE (Ash et al., 2020), an acquisition function that aims to combine diversity and uncertainty sampling, by computing gradient embeddings g x for every candidate data point x in D pool and then using clustering to select a batch.
Each g x is computed as the gradient of the crossentropy loss with respect to the parameters of the model's last layer, aiming to be the component that incorporates uncertainty in the acquisition function 7 . We also evaluate a recently introduced coldstart acquisition function called ALPS (Yuan et al., 2020) that uses the masked language model (MLM) loss of BERT as a proxy for model uncertainty in the downstream classification task. Specifically, aiming to leverage both uncertainty and diversity, ALPS forms a surprisal embedding s x for each x, by passing the unmasked input x through the BERT MLM head to compute the cross-entropy loss for a random 15% subsample of tokens against the target labels. ALPS clusters these embeddings to sample b sentences for each AL iteration. Lastly, we include RANDOM, that samples data from the pool from a uniform distribution.

Implementation Details
We use BERT-BASE (Devlin et al., 2019) adding a task-specific classification layer using the implementation from the HuggingFace library (Wolf et al., 2020). We evaluate the model 5 times per epoch on the development set following Dodge et al. (2020) and keep the one with the lowest validation loss. We use the standard splits provided for all datasets, if available, otherwise we randomly sample a validation set from the training set. We test all models on a held-out test set. We repeat all experiments with five different random seeds resulting into different initializations of the parameters of the model's extra task-specific output feedfor- ward layer and the initial D lab . For all datasets we use as budget the 15% of D pool , initial training set 1% and acquisition size b = 2%. Each experiment is run on a single Nvidia Tesla V100 GPU. More details are provided in the Appendix A.1.

In-domain Performance
We present results for in-domain test accuracy across all datasets and acquisition functions in Figure 2. We observe that CAL is consistently the top performing method especially in DBPEDIA, PUBMED and AGNEWS datasets. CAL performs slightly better than ENTROPY in IMDB, QNLI and QQP, while in SST-2 most methods yield similar results. ENTROPY is the second best acquisition function overall, consistently performing better than diversity-based or hybrid baselines. This corroborates recent findings from Desai and Durrett (2020) that BERT is sufficiently calibrated (i.e. produces good uncertainty estimates), making it a tough baseline to beat in AL.
BERTKM is a competitive baseline (e.g. SST-2, QNLI) but always underperforms compared to CAL and ENTROPY, suggesting that uncertainty is the most important signal in the data selection process. An interesting future direction would be to investigate in depth whether and which (i.e. which layer) representations of the current (pretrained language models) works best with similarity search algorithms and clustering.
Similarly, we can see that BADGE, despite using both uncertainty and diversity, also achieves low performance, indicating that clustering the constructed gradient embeddings does not benefit data acquisition. Finally, we observe that ALPS generally underperforms and is close to RANDOM. We can conclude that this heterogeneous approach to uncertainty, i.e. using the pretrained language model as proxy for the downstream task, is beneficial only in the first few iterations, as shown in Yuan et al. (2020).
Surprisingly, we observe that for the SST-2 dataset ALPS performs similarly with the highest performing acquisition functions, CAL and EN-TROPY. We hypothesize that due to the informal textual style of the reviews of SST-2 (noisy social media data), the pretrained BERT model can be used as a signal to query linguistically hard examples, that benefit the downstream sentiment analysis task. This is an interesting finding and a future research direction would be to investigate the correlation between the difficulty of an example in a TRAIN (ID) SST  downstream task with its perplexity (loss) of the pretrained language model.

Out-of-domain Performance
We also evaluate the out-of-domain (OOD) robustness of the models trained with the actively acquired datasets of the last iteration (i.e. 15% of D pool or 100% of the AL budget) using different acquisition strategies. We present the OOD results for SST-2, IMDB and QQP in Table 2. When we test the models trained with SST-2 on IMDB (first column) we observe that CAL achieves the highest performance compared to the other methods by a large margin, indicating that acquiring contrastive examples can improve OOD generalization. In the opposite scenario (second column), we find that the highest accuracy is obtained with ENTROPY. However, similarly to the ID results for SST-2 (Figure 2), all models trained on different subsets of the IMDB dataset result in comparable performance when tested on the small SST-2 test set (the mean accuracies lie inside the standard deviations across models). We hypothesize that this is because SST-2 is not a challenging OOD dataset for the different IMDB models. This is also evident by the high OOD accuracy, 85% on average, which is close to the 91% SST-2 ID accuracy of the full model (i.e. trained on 100% of the ID data). Finally, we observe that CAL obtains the highest OOD accuracy for QQP compared to RANDOM, ENTROPY and ALPS. Overall, our empirical results show that the models trained on the actively acquired dataset with CAL obtain consistently similar or better performance than all other approaches when tested on OOD data.

Ablation Study
We conduct an extensive ablation study in order to provide insights for the behavior of every component of CAL. We present all AL experiments on the AGNEWS dataset in Figure 3.  with argmin(.) in line 8 of Algorithm 1). We observe ( Fig. 3 -CAL opposite) that even after acquiring 15% of unlabeled data, the performance remains unchanged compared to the initial model (of the first iteration), even degrades. In effect, this finding denotes that CAL does select informative data points.
Neighborhood Next, we experiment with changing the way we construct the neighborhoods, aiming to improve computational efficiency. We thus modify our algorithm to create a neighborhood for each labeled example (instead of unlabeled). 8 . This way we compute a divergence score only for the neighbors of the training data points. However, we find this approach to slightly underperform (Fig. 3 -CAL per labeled example), possibly because only a small fraction of the pool is considered and thus the uncertainty of all the unlabeled data points is not taken into account. 8 In this experiment, we essentially change the for-loop of Algorithm 1 (cf. line 1-7) to iterate for each x l in D lab (instead of each xp in D pool ) and similarly find the k nearest neighbors of each labeled example in the pool (KNN(x l , D pool , k)) As for the scoring (cf. line 6), if an unlabeled example was not picked (i.e. was not a neighbor to a labeled example), its score is zero. If it was picked multiple times we average its scores. We finally acquire the top b unlabeled data with the highest scores. This formulation is more computationally efficient since usually |D lab | << |D pool |.

Scoring function
We also experiment with several approaches for constructing our scoring function (cf. line 6 in Algorithm 1). Instead of computing the KL divergence between the predicted probabilities of each candidate example and its labeled neighbors (cf. line 5), we used cross entropy between the output probability distribution and the gold labels of the labeled data. The intuition is to evaluate whether information of the actual label is more useful than the model's predictive probability distribution. We observe this scoring function to result in a slight drop in performance (Fig. 3 -Cross  Entropy). We also experimented with various pooling operations to aggregate the KL divergence scores for each candidate data point. We found maximum and median (Fig. 3 -Max/Median) to perform similarly with the average (Fig. 3 -CA L), which is the pooling operation we decided to keep in our proposed algorithm.
Feature Space Since our approach is related to to acquiring data near the model's decision boundary, this effectively translates into using the [CLS] output embedding of BERT. Still, we opted to cover several possible alternatives to the representations, i.e. feature space, that can be used to find the neighbors with KNN. We divide our exploration into two categories: intrinsic representations from the current fine-tuned model and extrinsic using different methods. For the first category, we examine representing each example with the mean embedding layer of BERT (Fig. 3 -Mean embedding) or the mean output embedding ( Fig. 3 -Mean output). We find both alternatives to perform worse than using the [CLS] token ( Fig. 3 -CA L). The motivation for the second category is to evaluate whether acquiring contrastive examples in the input feature space, i.e. representing the raw text, is meaningful (Gardner et al., 2020) 9 . We thus examine contextual representations from a pretrained BERT language model (Fig. 3 -BERT-pr [CLS]) (not fine-tuned in the task or domain) and non-contextualized TF-IDF vectors (Fig. 3 -TF-IDF). We find both approaches, along with Mean embedding, to largely underperform compared to our approach that acquires ambiguous data near the model decision boundary. 9 This can be interpreted as comparing the effectiveness of selecting data near the model decision boundary vs. the task decision boundary, i.e. data that are similar for the task itself or for the humans (in terms of having the same raw input/vocabulary), but are from different classes.

Analysis
Finally, we further investigate CAL and all acquisition functions considered (baselines), in terms of diversity, representativeness and uncertainty. Our aim is to provide insights on what data each method tends to select and what is the uncertainty-diversity trade-off of each approach. Table 3 shows the results of our analysis averaged across datasets. We denote with L the labeled set, U the unlabeled pool and Q an acquired batch of data points from U 10 .

Diversity & Uncertainty Metrics
Diversity in input space (DIV.-I) We first evaluate the diversity of the actively acquired data in the input feature space, i.e. raw text, by measuring the overlap between tokens in the sampled sentences Q and tokens from the rest of the data pool U . Following Yuan et al. (2020), we compute DIV.-I as the Jaccard similarity between the set of tokens from the sampled sentences Q, V Q , and the set of tokens from the unsampled sentences U\Q, V Q , denotes the [CLS] output token of example x i obtained by the model which was trained using L, and d(Φ(x i ), Φ(x j )) denotes the Euclidean distance between x i and x j in the feature space.

Uncertainty (UNC.)
To measure uncertainty, we use the model M f trained on the entire training dataset (Figure 2 -Full supervision). As in Yuan et al. (2020), we use the logits from the fully trained model to estimate the uncertainty of an example, as it is a reliable estimate due to its high performance after training on many examples, while 10 In the previous sections we used D lab and D pool to denote the labeled and unlabeled sets and we change the notation here to L and U , respectively, for simplicity. 11 To enable an appropriate comparison, this analysis is performed after the initial BERT model is trained with the initial training set and each AL strategy has selected examples equal to 2% of the pool (first iteration). Correspondingly, all strategies select examples from the same unlabeled set U while using outputs from the same BERT model.  p(y = c|x)logp(y = c|x). As a sampled batch Q we use the full actively acquired dataset after completing our AL iterations (with 15% of the data).

Representativeness (REPR.)
We finally analyze the representativeness of the acquired data as in Ein-Dor et al. (2020). We aim to study whether AL strategies tend to select outlier examples that do not properly represent the overall data distribution. We rely on the KNN-density measure proposed by Zhu et al. (2008), where the density of an example is quantified by one over the average distance between the example and its K most similar examples (i.e., K nearest neighbors) within U , based on the [CLS] representations as in DIV.-F. An example with high density degree is less likely to be an outlier. We define the representativeness of a batch Q as one over the average KNN-density of its instances using the Euclidean distance with K=10.

Discussion
We first observe in Table 3 that ALPS acquires the most diverse data across all approaches. This is intuitive since ALPS is the most linguisticallyinformed method as it essentially acquires data that are difficult for the language modeling task, thus favoring data with a more diverse vocabulary. All other methods acquire similarly diverse data, except BADGE that has the lowest score. Interestingly, we observe a different pattern when evaluating diversity in the model feature space ( Overall, our analysis validates assumptions on the properties of data expected to be selected by the various acquisition functions. Our findings show that diversity in the raw text does not necessarily correlate with diversity in the feature space. In other words, low DIV.-F does not translate to low diversity in the distribution of acquired tokens (DIV.-I), suggesting that CAL can acquire similar examples in the feature space that have sufficiently diverse inputs. Furthermore, combining the results of our AL experiments ( Figure 2) and our analysis ( Table 3) we conclude that the best performance of CAL, followed by ENTROPY, is due to acquiring uncertain data. We observe that the most notable difference, in terms of selected data, between the two approaches and the rest is uncertainty (UNC.), suggesting perhaps the superiority of uncertainty over diversity sampling. We show that CAL improves over ENTROPY because our algorithm "guides" the focus of uncertainty sampling by not considering redundant uncertain data that lie away from the decision boundary and thus improving representativeness. We finally find that RANDOM is evidently the worst approach, as it selects the least diverse and uncertain data on average compared to all methods.

Related Work
Uncertainty Sampling Uncertainty-based acquisition for AL focuses on selecting data points that the model predicts with low confidence. A simple uncertainty-based acquisition function is least confidence (Lewis and Gale, 1994) that sorts data in descending order from the pool by the probability of not predicting the most confident class. Another approach is to select samples that maximize the predictive entropy. Houlsby et al. (2011) propose Bayesian Active Learning by Disagreement (BALD), a method that chooses data points that maximize the mutual information between predictions and model's posterior probabilities. Gal et al. (2017) applied BALD for deep neural models using Monte Carlo dropout (Gal and Ghahramani, 2016) to acquire multiple uncertainty estimates for each candidate example. Least confidence, entropy and BALD acquisition functions have been applied in a variety of text classification and sequence labeling tasks, showing to substantially improve data efficiency (Shen et al., 2017;Siddhant and Lipton, 2018;Lowell and Lipton, 2019;Kirsch et al., 2019;Shelmanov et al., 2021;Margatina et al., 2021).
Diversity Sampling On the other hand, diversity or representative sampling is based on selecting batches of unlabeled examples that are representative of the unlabeled pool, based on the intuition that a representative set of examples once labeled, can act as a surrogate for the full data available. In the context of deep learning, Geifman and El-Yaniv (2017) and Sener and Savarese (2018) select representative examples based on core-set construction, a fundamental problem in computational geometry. Inspired by generative adversarial learning, Gissin and Shalev-Shwartz (2019) define AL as a binary classification task with an adversarial classifier trained to not be able to discriminate data from the training set and the pool. Other approaches based on adversarial active learning, use out-of-thebox models to perform adversarial attacks on the training data, in order to approximate the distance from the decision boundary of the model (Ducoffe and Precioso, 2018;Ru et al., 2020).
Hybrid There are several existing approaches that combine representative and uncertainty sampling. Such approaches include active learning algorithms that use meta-learning (Baram et al., 2004;Hsu and Lin, 2015) and reinforcement learning (Fang et al., 2017;Liu et al., 2018), aiming to learn a policy for switching between a diversitybased or an uncertainty-based criterion at each iteration. Recently, Ash et al. (2020) propose Batch Active learning by Diverse Gradient Embeddings (BADGE) and Yuan et al. (2020) propose Active Learning by Processing Surprisal (ALPS), a coldstart acquisition function specific for pretrained language models. Both methods construct representations for the unlabeled data based on uncertainty, and then use them for clustering; hence combining both uncertainty and diversity sampling. The effectiveness of AL in a variety of NLP tasks with pretrained language models, e.g. BERT (Devlin et al., 2019), has empirically been recently evaluated by Ein-Dor et al. (2020), showing substantial improvements over random sampling.

Conclusion & Future Work
We present CAL, a novel acquisition function for AL that acquires contrastive examples; data points which are similar in the model feature space and yet the model outputs maximally different class probabilities. Our approach uses information from the feature space to create neighborhoods for each unlabeled example, and predictive likelihood for ranking the candidate examples. Empirical experiments on various in-domain and out-of-domain scenarios demonstrate that CAL performs better than other acquisition functions in the majority of cases. After analyzing the actively acquired datasets obtained with all methods considered, we conclude that entropy is the hardest baseline to beat, but our approach improves it by guiding uncertainty sampling in regions near the decision boundary with more informative data.
Still, our empirical results and analysis show that there is no single acquisition function to outperform all others consistently by a large margin. This demonstrates that there is still room for improvement in the AL field.
Furthermore, recent findings show that in specific tasks, as in Visual Question Answering (VQA), complex acquisition functions might not outperform random sampling because they tend to select collective outliers that hurt model performance (Karamcheti et al., 2021). We believe that taking a step back and analyzing the behavior of standard acquisition functions, e.g. with Dataset Maps (Swayamdipta et al., 2020) (Wolf et al., 2020) in Pytorch (Paszke et al., 2019). We train all models with batch size 16, learning rate 2e − 5, no weight decay, AdamW optimizer with epsilon 1e−8. For all datasets we use maximum sequence length of 128, except for IMDB that contain longer input texts, where we use 256. To ensure reproducibility and fair comparison between the various methods under evaluation, we run all experiments with the same five seeds that we randomly selected from the range [1,9999]. We evaluate the model 5 times per epoch on the development set following Dodge et al. (2020) and keep the one with the lowest validation loss. We use the code provided by Yuan et al. (2020) for ALPS, BADGE and BERTKM.

A.2 Efficiency
In this section we compare the efficiency of the acquisition functions considered in our experiments. We denote m the number of labeled data in D lab , n the number of unlabeled data in D pool , C the number of classes in the downstream classification task, d the dimension of embeddings, t is fixed number of iterations for k-MEANS, l the maximum sequence length and k the acquisition size. In our experiments, following (Yuan et al., 2020), k = 100, d = 768, t = 10, and l = 128 12 . ALPS requires O(tknl) considering that the surprisal embeddings are computed. BERTKM and BADGE, the 12 Except for IMDB where l = 256. most computationally heavy approaches, require O(knd) and O(Cknd) respectively, given that gradient embeddings are computed for BADGE 13 . On the other hand, ENTROPY only requires n forward passes though the model, in order to obtain the logits for all the data in D pool . Instead, our approach, CAL, first requires m + n forward passes, in order to acquire the logits and the CLS representations of the the data (in D pool and D lab ) and then one iteration for all data in D pool to obtain the scores.
We present the runtimes in detail for all datasets and acquisition functions in Tables 4 and 5. First, we define the total acquisition time as a sum of two types of times; inference and selection time. Inference time is the time that is required in order to pass all data from the model to acquire predictions or probability distributions or model encodings (representations). This is explicitly required for the uncertainty-based methods, like ENTROPY, and our method CAL. The remaining time is considered selection and essentially is the time for all necessary computations in order to rank and select the b most important examples from D pool .
We observe in Table 4 that the diversity-based functions do not require this explicit inference time, while for ENTROPY it is the only computation that is needed (taking the argmax of a list of uncertainty scores is negligible). CAL requires both inference and selection time. We can see that inference time of CAL is a bit higher than ENTROPY because we do m + n forward passes instead of n, that is equivalent to both D pool and D lab instead of only D pool . The selection time for CAL is the for-loop as presented in our Algorithm 1. We observe that it is often less computationally expensive than the inference step (which is a simple forward pass through the model). Still, there is room for improvement in order to reduce the time complexity of this step.
In Table 5 we present the total time for all datasets (ordered with increasing D pool size) and the average time for each acquisition function, as a means to rank their efficiency. Because we do not apply all acquisition functions to all datasets we compute three different average scores in order to ensure fair comparison. AVG.-ALL is the average time across all 7 datasets and is used to compare RANDOM, ALPS, ENTROPY and CAL. AVG.-3 is the average time across the first 3 datasets (IMDB,  and is used to compare all