ALLSH: Active Learning Guided by Local Sensitivity and Hardness

Active learning, which effectively collects informative unlabeled data for annotation, reduces the demand for labeled data. In this work, we propose to retrieve unlabeled samples with a local sensitivity and hardness-aware acquisition function. The proposed method generates data copies through local perturbations and selects data points whose predictive likelihoods diverge the most from their copies. We further empower our acquisition function by injecting the select-worst case perturbation. Our method achieves consistent gains over the commonly used active learning strategies in various classification tasks. Furthermore, we observe consistent improvements over the baselines on the study of prompt selection in prompt-based few-shot learning. These experiments demonstrate that our acquisition guided by local sensitivity and hardness can be effective and beneficial for many NLP tasks.


Introduction
Crowdsourcing annotations (Rajpurkar et al., 2016;Bowman et al., 2015) has become a common practice for developing NLP benchmark datasets. Rich prior works (Pavlick and Kwiatkowski, 2019;Nie et al., 2020;Ferracane et al., 2021) show that the time-consuming and expensive manual labeling in crowdsourcing annotations are not an annotation artifact but rather core linguistic phenomena. Active Learning (AL) is introduced to efficiently acquire data for annotation from a (typically large) pool of unlabeled data. Its goal is to concentrate the human labeling effort on the most informative data in hopes of maximizing the model performance while minimizing the data annotation cost.
Popular approaches to acquiring data for AL are uncertainty sampling and diversity sampling. Uncertainty sampling selects data that the model pre-Code is available at https://github.com/ szhang42/allsh dicts with low-confidence (Lewis and Gale, 1994;Culotta and McCallum, 2005;Settles, 2009). Diversity sampling selects batches of unlabeled examples that are prototypical of the unlabeled pool to exploit heterogeneity in the feature space (Xu et al., 2003;Bodó et al., 2011). Different from these two perspectives, recent works focus on the informativeness of the selected data. For example, Zhang and Plank (2021) acquire informative unlabeled data using the training dynamics based on the model predictive log likelihood. Margatina et al. (2021) construct contrastive examples in the input feature space. However, these methods either ignore the local sensitivity of the input features or take no consideration of the difficulty of the learning data. Consequently, they may ignore examples around the decision boundary, or select hard-to-train or even noisy examples. Their performance may further suffer under some practical settings, such as those with imbalanced labels and when there is a very limited annotation budget.
In this work, we determine the informativeness by considering both the local sensitivity and learning difficulty. For local sensitivity, we take the classical definition from Chapelle et al. (2009), which is widely used in both classic machine learning problems (e.g. Blum and Chawla, 2001;Chapelle et al., 2002;Seeger, 2000;Zhu et al., 2003;Zhou et al., 2004) and recent deep learning settings (e.g. Wang et al., 2018b;Sohn et al., 2020;Xu et al., 2021). Specifying a local region R region (x) around an example x, we assume in our prior that all examples in R region (x) have the same labels. 2 If the examples in R region (x) give us different labels, we say the local region of x is sensitive. Data augmentation has been chosen as the way to create label-equivalent local regions in many recent works (e.g., Berthelot et al., 2019b;Xie et al., 2020). We utilize data augmentation as a tool to capture the local sensitivity and hardness of inputs and present ALLSH: Active Learning guided by Local Sensitivity and Hardness. Through various designs on local perturbations, ALLSH selects unlabeled data points from the pool whose predictive likelihoods diverge the most from their augmented copies. This way, ALLSH can effectively ensure the informative and local-sensitive data to have correct humanannotated labels. Figure 1 illustrates the scheme of the proposed acquisition strategy.
We conduct a comprehensive evaluation of our approach on datasets ranging from sentiment analysis, topic classification, natural language inference, to paraphrase detection. To measure the proposed acquisition function in more realistic settings where the samples stem from a dissimilar input distribution, we (1) set up an out-of-domain test dataset and (2) leak out-of-domain data (e.g., adversarial perturbations) into the selection pool.
We further expand the proposed acquisition to a more challenging setting: prompt-based few-shot learning (Zhao et al., 2021), where we query a fixed pre-trained language model via a natural language prompt containing a few training examples. We focus on selecting the most valuable prompts for a given test task (e.g., selecting 4 prompts for one given dataset). We adapt our acquisition function to retrieve prompts for the GPT-2 model. Furthermore, we provide extensive ablation studies on different design choices for the acquisition function, including the designs of augmentations and divergences. Our method shows consistent gains in all settings with multiple datasets. With little modification, our data acquisition can be easily applied to other NLP tasks for a better sample selection strategy.
Our contributions are summarized as follows: (1) Present a new acquisition strategy, embracing local sensitivity and learning difficulty, such as paraphrasing the inputs through data augmentation and adversarial perturbations, into the selection procedure.
(2) Verify the effectiveness and general applicability of the proposed method in more practical settings with imbalanced datasets and extremely few labeled data. (3) Provide comprehensive study and experiments of the proposed selection criteria in classification tasks (both in-domain and out-ofdomain evaluations) and prompt-based few-shot learning. (4) The proposed data sampling strategy can be easily incorporated or extended to many other NLP tasks.

Method
In this section we present in detail our proposed method, ALLSH (Algorithm 1).

Active Learning Loop
The active learning setup consists of an unlabeled dataset D pool , the current training set D label , and a model M whose output probability is p θ (· | x) for input x. The model M is generally a pre-trained model for NLP tasks (Lowell et al., 2018). At each iteration, we train a model on D label and then use the acquisition function to acquire s acq sentences in a batch T from D pool . The acquired examples from this iteration are labeled, added to D label , and removed from D pool . Then the updated D label serves as the training set in the next AL iteration until we exhaust the budget. Overall, the system is given a budget of S queries to build a labeled training dataset of size S.

Acquisition Function Design
To fully capture the data informativeness and train a model with a limited amount of data, we consider two data-selection principals: local sensitivity and learning hardness. Local Sensitivity Based on theoretical works on the margin theory for active learning, the examples lying close to the decision boundary are informative and worth labeling (Ducoffe and Precioso, 2018;Margatina et al., 2021). Uncertainty sampling suffers from the sampling bias problem as the model is only trained with few examples in the early phase of training. In addition, high uncertainty samples given the current model state may not be that representative to the whole unlabeled data (Ru et al., 2020). For example, if an input has high confidence while its local perturbation generates low-confidence output, then it is likely that this input lies close to the model decision boundary. This information can be captured by measuring the difference between an input example and its augmentation in the output feature space. We utilize the back-translation (Sennrich et al., 2016;Edunov et al., 2018;Zhang et al., 2021b) and TF-IDF (Xie et al., 2020) as effective augmentation methods which can generate diverse paraphrases while preserving the semantics of the original inputs (Yu et al., 2018b).
Instead of simply using augmentation, adversarial perturbation can measure the local Lipschitz and sensitivity more effectively. We therefore further exploit adversarial perturbation to more accurately measure local sensitivity. For NLP problems, generating exact adversarial perturbations in a discrete space usually requires combinatorial optimization, which often suffers from the curse of dimensionality (Madry et al., 2017;Lei et al., 2018). Hence, we choose the hardest augmentation over K random augmentations as a "lightweight" variant of adversarial input augmentation which optimizes the worst case loss over the augmented data. Learning Hardness: From Easy to Hard Learning from easy examples or propagating labels from high-confidence examples is the key principle for curriculum learning (Bengio et al., 2009) and label propagation based semi-supervised learning algorithms (Chapelle et al., 2009). For example, Fix-Match (Sohn et al., 2020), a SOTA semi-supervised method, applies an indicator function to select high confident examples at each iteration. This will facilitate the label information from high confidence examples to low-confidence ones (Chapelle et al., 2009). In our selection criterion, as the model is trained with limited data, we also want to avoid the hard-to-learn examples, which in some cases frequently correspond to mislabeled or erroneous instances (Swayamdipta et al., 2020;Zhang and Plank, 2021). These examples may stuck the model performance at the beginning of the selection.

Acquisition with Local Sensitivity and Hardness
We come to the definition of our acquisition function. Given a model p θ and an input x, we compute the output distribution p θ (· | x) and a noised version p θ (· | x ) by injecting a random transformation x = g(x) to the inputs. Here, g(·) is sampled from a family of transformations and these random transformations stand for data augmentations. This procedure can select examples that are insensitive to transformation g(·) and hence smoother with respect to the changes in the input space (Berthelot et al., 2019b,a;Sohn et al., 2020). We calculate where D denotes a statistical distance such as the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951). Model p θ here can be a pretrained language model such as BERT (Devlin et al., 2018).
Data Paraphrasing via Augmentation Paraphrase generation can improve language models (Yu et al., 2018a) by handling language varia- tion. TF-IDF and backtranslation can generate diverse inputs while preserving the semantic meaning (Singh et al., 2019;Xie et al., 2020). For TF-IDF, we replace uninformative words with low TF-IDF scores while keeping those with high. Specifically, Suppose IDF(w) is the IDF score for word w computed on the whole corpus, and TF(w) is the TF score for word w in a sentence. We compute the TF-IDF score as TFIDF(w) = TF(w)IDF(w). For backtranslation, we use a pre-trained EN-DE and DE-EN translation models (Ng et al., 2019) to perform backtranslation on each sentence. We denote x as (x 0 , · · · , x n ). Here, n denotes the original length of the input. For x, we pass them through two translation models to get x = (x 0 , · · · , x m ), where m denotes the length after backtranslating. More details can be found in Appendix A.
Select Worst-Case Augmentation (WCA) In order to construct effective local sensitivity, the most direct approach is calculating the local Lipschitz constant or finding the worst case adversarial perturbation. However, estimating the Lipschitz constant for a neural network is either model dependent or computationally hard (Scaman and Virmaux, 2018;Fazlyab et al., 2019). Instead, we select worst-case augmentation over K copies, which can still roughly measure the norm of the first-order gradient without a huge computation cost and is easy to implement. Given input examples x, and K augmentation of x as {x i } K i=1 , we propose the following acquisition function to select data: (2) Inspired by some simple and informal analysis in continuous space, we draw the connection between calculating max (x) and local sensitivity by Orange circles refer to the unlabeled data and green circles refer to the corresponding augmentation of the orange unlabeled data.
Recent works in computer vision (Gong et al., 2020;Wang et al., 2021) have provided more formal connections between local gradient norm estimation and K-worst perturbations.
The text sentences in NLP are in the discrete space, which lacks the definition of local Lipschitz, but finding the worst perturbation in a local discrete set can still be a better measurement of local sensitivity in the semantic space. Choice of Divergence We use the KL divergence as the primary measure of the statistical distance between the distribution of the original examples and that over augmented examples. We also empirically provide detailed analysis of the Jensen-Shannon Distance (JSD) (Endres and Schindelin, 2003) and α-divergence (Minka et al., 2005) as a complementary measure in Section 5. The α-divergence (Pillutla et al., 2021) is a general divergence family, which includes the most popular KL divergence and reverse KL divergence. Different value of α makes the divergence trade-off between overestimation and underestimation. JSD is a metric function based on a mathematical definition which is symmetric and bounded within the range [0, 1]. These divergences are calculated as: where p is the output probability distribution of an example, q is the output probability distribution of an augmented example, and m = 1 2 (p + q). Local Sensitivity and Informativeness The divergence objective exploits unlabeled data by measuring predictions across slightly-distorted versions of each unlabeled sample. The diverse and adversarial augmentations capture the local sensitivity and informativeness of inputs and project examples to the decision boundary (Ducoffe and Precioso, 2018). Thus, the examples and their copies with highly inconsistent model predictions lie close to the decision boundary of the model (Gao et al., 2020). These examples are valuable to have human annotations because they 1) contain high-confidence region in a local perturbation and are therefore easy to train 2) are highly likely to promote the model with large-margin improvements (see example in Figure 2). Under our local sensitivity and hardness guided acquisition, we argue the selected examples would not be necessarily the examples with the highest uncertainty, which do not always benefit the training. For instance, an example may have low-confidence prediction of both original inputs and augmented inputs thus making the samples most hard to train.

More Details
Compute Distance We compute the divergence in the model predictive probabilities for the pairs of the input and its augmentations in Eqn (1). Specifically, we use a pretrained BERT in classification tasks and GPT-2 in prompt-based few-shot learning as the base model p θ to obtain the output probabilities for all unlabeled data points in D pool . We then compute the divergence value with Eqn (1). Rank and Select Candidates We apply these steps to all candidate examples from D pool and obtain the divergence value for each. Our acquisition function selects the top s acq examples that have the highest divergence value from the acquired batch T .
In the prompt-based few-shot learning, we fol-Algorithm 1: Acquisition with Local Sensitivity and Hardness 1: Input: labeled data D label , unlabeled data D pool , acquisition size sacq, model M with output probability p θ (· | x).

Classification Task
We compare the proposed ALLSH against four baseline methods. We choose these baselines as they cover a spectrum of acquisition functions (uncertainty, batch-mode, and diversity-based).
Random samples data from the pool of unlabeled data D pool following a uniform distribution. Entropy selects s acq sentences with the highest predictive entropy (Lewis and Gale, 1994) mea-

Implementation Details
For classification, we use BERT-base (Devlin et al., 2018) from the HuggingFace library (Wolf et al., 2020). We train all models with batch size 16, learning rate 2 × 10 −5 , and AdamW optimizer with epsilon 1×10 −8 . For all datasets, we set the default annotation budget as 1%, the maximum annotation budget as 15%, initial accumulated labeled data set D label as 0.1% of the whole unlabeled data, and acquisition size as 50 instances for each active learning iterations, following prior work (e.g., Gissin and Shalev-Shwartz, 2019;Dor et al., 2020;Ru et al., 2020). Curriculum Learning (CL) We further combine our acquisition function with advances in semi-supervised learning (SSL) (Berthelot et al., 2019a;Sohn et al., 2020), which also integrates abundant unlabeled data into learning. A recent line of work in SSL utilizes data aug-mentations, such as TF-IDF and back-translation, to enforce local consistency of the model (Sajjadi et al., 2016;Miyato et al., 2018). Here SSL can further distill information from unlabeled data and gradually propagate label information from labeled examples to unlabeled one during the training stage (Xie et al., 2020;Zhang et al., 2021c). We construct the overall loss function as where L S is the cross-entropy supervised learning loss over labeled samples, L U is the consistency regularization term, and α is a coefficient (Tarvainen and Valpola, 2017;Berthelot et al., 2019b). For prompt-based few-shot learning, we run experiments on 1.5B-parameters GPT-2 (Radford et al., 2019), a Transformer (Vaswani et al., 2017) based language model. It largely follows the details of the OpenAI GPT model (Radford et al., 2018). We take the TF-IDF as the default augmentation method and provide a rich analysis of other augmentation methods in Section 5. More detailed experimental settings are included in Appendix A.

Experiments
We evaluate the performance of our acquisition and learning framework in this section. We bold the best results within Random, Entropy, BADGE, CAL, and the proposed ALLSH (Ours) in tables. Then, we bold the best result within each column block. All experimental results are obtained with five independent runs to determine the variance. See Appendix A for the full results with error bars.

In-Domain Classification Task Results
In Table 2, we evaluate the impact of our acquisition function under three different annotation budgets (1%, 5%, and 10%). With a constrained annotation budget, we see substantial gains on test accuracy with our proposed acquisition: ALLSH and selecting worst-case augmentation. With this encouraging initial results, we further explore our acquisition with curriculum learning. Across all settings, ALLSH is consistently the top performing method especially in SST-2, IMDB, and AG News. With a tight budget, our proposed acquisition can successfully integrate the local sensitivity and learning difficulty to generate annotated data.
For BADGE, despite combining both uncertainty and diversity sampling, it only achieves the compa-rable results on QNLI, showing that gradient computing may not directly benefit data acquisitions. In addition, requiring clustering for high dimensional data, BADGE is computationally heavy as its complexity grows exponentially with the acquisition size (Yuan et al., 2020). We provide rich analysis of the sampling efficiency and running time for each method in Appendix A and include the results in Table 13. Also, ALLSH outperforms the common uncertainty sampling in most cases. Given the current model state, uncertainty sampling chooses the samples that are not representative to the whole unlabeled data, leading to ineffective sampling. CAL has an effective contrastive acquiring on QNLI. We hypothesize that due to the presence of lexical and syntactic ambiguity between a pair of sentence, the contrastive examples can be used to push away the inputs in the feature space.   Table 11 in the Appendix.

Out-of-Domain Classification Task Results
We compare our proposed method with the baselines for their performance in an out-of-domain (OD) setting and summarize the results in Table 3. We test domain generalization on three datasets with two tasks, including sentiment analysis and paraphrase detection. We set the annotation budget as 15% of D pool for all OD experiments. For OD in SST-2 and IMDB, ALLSH yields better results than all baselines with a clear margin (

Prompt-Based Few-Shot Learning Results
We present the prompt-based few-shot learning results with GPT-2 in Table 4, in which we follow the setting (4-shot, 8-shot, and 12-shot) in Zhao et al. (2021). Few-shot learners suffer from the quality of labeled data (Sohn et al., 2020), and previous acquisition functions usually fail to boost the performance from labeling random sampled data. In Table 4, we observe that uncertain prompts performs similar to random selected prompts. A potential reason is that an under-trained model treats all examples as uncertainty examples and hard to distinguish the informativeness. However, our proposed acquisition demonstrates the strong capability in modeling the local sensitivity and learning from easy to hard. It comes to the best performance in most of the settings. These findings show the potential of using our acquisition to improve prompt-based few-shot learning and make a good in-context examples for GPT-2 model.

Analysis
Can we use our proposed acquisition in the imbalance setting? Extreme label imbalance is an important challenge in many non-pairwise NLP tasks (Sun et al., 2009;Zhang et al., 2017;Mussmann et al., 2020b). We set up the imbalance setting by sampling a subset with class-imbalance sample rate. For binary classification, we set the positive-class data sample rate as 1.0 and negativeclass data sample rate as 0.   Table 6: Acquisition performance for different augmentations. We report results of our acquisition with different augmentations to get the local copies of the samples.
of the original sentences. Select-worst case augments the inputs by incorporating the approximate adversarial perturbations. Table 6 indicates our method is insensitive to different augmentations. We also observe that WCA achieves the highest gains on two datasets. This confirms our discussion in Section 2.3 that select-worst case is capable of imposing local sensitivity.
What is the influence of the choice of divergence? We select different divergences in the statistical distance family and study their abilities in encoding different information. Corresponding to Section 2.3, we present the results in Table 7. We experiment on the KL divergence, JSD, and α-divergence (Minka et al., 2005) with the α value set as −0.5 or 0.5. We notice that for our case the difference between different divergences is small. A possible reason is that the number of class categories is small and therefore the choice of divergence does not have a large influence.
Can we use the proposed acquisition with extremely few labeled data? We have presented the results under very limited annotation budgets in Table 2. We set the annotation budget as 0.8% and 0.4%. The key observation is that the degradation of performance in the other acquisition functions are dramatic. For example, in IMDB, the uncertainty sampling (Entropy) shows the obvious performance drop. It suffers from the sampling bias problem because of the frequent variation of the decision boundary in the early phase of training with very few labeled data available, which results in ineffective sampling. Even under this extreme case, our acquisition still aims to select the most informative examples for the model. This further verifies our empirical results in Section 4.3 on prompt-based few-shot learning where only a very few in-context prompts are provided.

Related Work
Active Learning Active Learning has been widely used in many applications in NLP (Lowell et al., 2018;Dor et al., 2020;Ru et al., 2020). The uncertainty-based methods (Fletcher et al., 2008) have become the most common strategy.   Instead of only considering uncertainty, diversity sampling has also become an alternative direction. Recent works (Geifman and El-Yaniv, 2017;Sener and Savarese, 2017;Ash et al., 2020;Yuan et al., 2020) focus on different parts of diversity. Most recent works (e.g. Zhang and Plank, 2021;Margatina et al., 2021) have been more on exploiting the model behavior and each individual instance. Our work focuses more on the local sensitivity and informativeness of data, leading to better performance under various limited annotation settings.
Annotation Budgeting Annotation budgeting with learning has long been studied (Turney, 2002). Sheng et al. (2008) study the tradeoff between collecting multiple labels per example versus annotating more examples. On the other hand, different labeling strategies such as providing fine-grained rationales (Dua et al., 2020), active learning (Kirsch et al., 2019), and the training dynamics approach (Swayamdipta et al., 2020) are studied. Except standard classification, class-imbalance (Mussmann et al., 2020a) or noisy label cases (Fan et al., 2021;Chen et al., 2021) have also been explored. We utilize active learning to explore the labeling strategies and aim to select the most informative data for annotations.

Conclusion
Our work demonstrates the benefits of introducing local sensitivity and learning from easy to hard into the acquisition strategy. The proposed acquisition function shows noticeable gains in performance across classification tasks and prompt-based fewshot learning. In this work, we conduct the detailed study with the proposed acquisition strategy in different settings, including imbalanced and extremely limited labels. We also verify the impact of different choice of designs such as the choice of divergence and augmentations. To summarize, the proposed ALLSH is effective and general, with the potential to be incorporated into existing models for various NLP tasks. David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. 2019b Cluster kernels for semisupervised learning. In Advances in neural information processing systems. Citeseer.
Derek Chen, Zhou Yu, and Samuel R Bowman. 2021. Learning with noisy labels by targeted relabeling. arXiv preprint arXiv:2110.08355. Stephen Mussmann, Robin Jia, and P. Liang. 2020a. On the importance of adaptive data collection for extremely imbalanced pairwise tasks. In Conference on Empirical Methods in Natural Language Processing.

Aron
Stephen Mussmann, Robin Jia, and Percy Liang. 2020b. On the importance of adaptive data collection for extremely imbalanced pairwise tasks. arXiv preprint arXiv:2010.05103.

A Experimental details A.1 Full Results and Examples
We report the full results of out-of-domain and in-domain tasks in Tables 9 and 11, respectively. The full results of prompt-based few-shot learning are shown in Table 10 and Table 12 shows prompt examples of each task.

A.2 Classification Task Hyperparameters and Experimental Settings
Our implementation is based on the BERT-base (Devlin et al., 2018) from HuggingFace Transformers (Wolf et al., 2020). We optimize the KL divergence as the objective with the Adam optimizer (Kingma and Ba, 2014) and batch size is set to 16 for all experiments. The curriculum learning is trained for 200 iterations. The learning rate is 2 × 10 −5 . The α in Eqn (5) is set as 0.01 for all experiments. With longer input texts such as IMDB, we use 256 as the maximum sequence length. For others, we use 128. Following Ash et al. (2020) and Margatina et al. (2021), for the initial training set D label , we begin the active learning loop by