Adaptive Ranking-based Sample Selection for Weakly Supervised Class-imbalanced Text Classification

To obtain a large amount of training labels inexpensively, researchers have recently adopted the weak supervision (WS) paradigm, which leverages labeling rules to synthesize training labels rather than using individual annotations to achieve competitive results for natural language processing (NLP) tasks. However, data imbalance is often overlooked in applying the WS paradigm, despite being a common issue in a variety of NLP tasks. To address this challenge, we propose Adaptive Ranking-based Sample Selection (ARS2), a model-agnostic framework to alleviate the data imbalance issue in the WS paradigm. Specifically, it calculates a probabilistic margin score based on the output of the current model to measure and rank the cleanliness of each data point. Then, the ranked data are sampled based on both class-wise and rule-aware ranking. In particular, the two sample strategies corresponds to our motivations: (1) to train the model with balanced data batches to reduce the data imbalance issue and (2) to exploit the expertise of each labeling rule for collecting clean samples. Experiments on four text classification datasets with four different imbalance ratios show that ARS2 outperformed the state-of-the-art imbalanced learning and WS methods, leading to a 2%-57.8% improvement on their F1-score.


Introduction
Deep learning models rely heavily on high-quality yet expensive, labeled data.Owing to this considerable cost, the weak supervision (WS) paradigm has increasingly been used to reduce human efforts (Ratner et al., 2016a;Zhang et al., 2021).This approach synthesizes training labels with labeling rules to significantly improve the efficiency of creating training sets and have achieved competitive results in natural language processing (NLP) (Yu et al., 2020;Ren et al., 2020;Rühling Cachay et al., 2021).However, existing methods leveraging the WS paradigm to perform NLP tasks mostly focus on reducing the noise in training labels brought by labeling rules, while ignoring the common and critical problem of data imbalance.In fact, in a preliminary experiment performed as part of the present work (Fig. 1), we found that the WS paradigm may amplify the imbalance ratio of the dataset because the synthesized training labels tend to have more imbalanced distribution.
To address this issue, we propose ARS2 as a general model-agnostic framework based on the WS paradigm.ARS2 is mainly divided in two steps, including (1) warm-up, in which stage noisy data is used to train the model and obtain a noise detector; (2) continual training with adaptive ranking-based sample selection.In this stage, we use the noise detector trained in the warm-up stage to evaluate the cleanliness of the data, and use the ranking obtained based on this evaluation to sample the data.We followed previous works Ratner et al. (2016a); Ren et al. (2020); Zhang et al. (2022b) in using heuristic programmatic rules to annotate the data.In weak supervised learning, researchers use a label model to aggregate weak labels annotated by rules to estimate the probabilistic class distribution of each data point.In this work, we use a label model to integrate the weak labels given by the rules as pseudo-labels during the training process to obviate the need for manual labeling.
To select the samples most likely to be clean, we adopt a selection strategy based on small-loss, which is a very common method that has been verified to be effective in many situations (Jiang et al., 2018;Yu et al., 2019;Yao et al., 2020).Specifically, deep neural networks, have strong ability of memorization (Wu et al., 2018;Wei et al., 2021), will first memorize labels of clean data and then those of noisy data with the assumption that the clean data are of the majority in a noisy dataset.Data with small loss can thus be regarded as clean examples with high probability.Inspired by this approach, we propose probabilistic margin score (PMS) as a criterion to judge whether data are clean.Instead of using the confidence given by a model directly, a confidence margin is used for better performance (Ye et al., 2020).We also performed a comparative experiment on the use of margin versus the direct use of confidence, as described in Sec.3.3.
Sample selection based on weak labels can lead to severe class imbalance.Consequently, models trained using these imbalanced subsets can exhibit both superior performance on majority classes and inferior performance on minority classes (Cui et al., 2019).A reweighted loss function can partially address this problem.However, performance remains nonetheless limited by noisy labels, that is, data with majority-class features may be annotated as minority-class data incorrectly, which misleads the training process.Therefore, we propose a sample selection strategy based on class-wise ranking (CR) to address imbalanced data.Using this strategy, we can select relatively balanced sample batches for training and avoid the strong influence of the majority class.
To further exploit the expertise of labeling rules, we also propose another sample selection strategy called rule-aware ranking (RR).We use aggregated labels as pseudo-labels in the WS paradigm and discards weak labels.However, the annotations generated by rules are likely to contain a considerable amount of valid information.For example, some rules yield a high proportion of correct results.The higher the PMS, the more likely the labeling result of the rules is to be close to the ground truth.Using this strategy, we can select batches with clean data for training and avoid the influence of noise.
The primary contributions of this work are summarized as follows.(1) We propose a general, model-agnostic weakly supervised leading framework called ARS2 for imbalanced datasets; (2) we also propose two reliable adaptive sampling strategies to address data imbalance issues.(3) The results of experiments on four benchmark datasets are presented to demonstrate that the ARS2 improved on the performance of existing imbalanced learning and weakly supervised learning methods, by 2%-57.8% in terms of F1-score.
2 Weakly Supervised Class-imbalanced Text Classification

Problem Formulation
In this work, we study class-imbalanced text classification in a setting with weak supervision.Specifically, we consider an unlabeled dataset D consisting of N documents, each of which is denoted by x i ∈ X .For each document x i , the corresponding label y i ∈ Y = {1, 2, ..., C} is unknown to us, whereas the class prior p(y) is given and highly imbalanced.Our goal is to learn a parameterized function f (•; θ) : X − → ∆ C1 which outputs the class probability p(y | x) and can be used to classify documents during inference.
To address the lack of ground truth training labels, we adopt the two-stage weak supervision paradigm (Ratner et al., 2016b;Zhang et al., 2021).
In particular, we rely on k user-provided heuristic rules {r i } i∈{1,...,k} to provide weak labels.Each rule r i is associated with a particular label y r i ∈ Y, and we denote by l i the output of the rule r i .It either assigns the associated label (l i = y r i ) to a given document or abstains (l i = −1) on this example.Note that the user-provided rules could be noisy and conflict with one another.For the document x, we concatenate the output weak labels of k rules l 1 , ..., l k as l x .Throughout this work, we apply the weak labels output by heuristic rules to train a text classifier.

Aggregation of Weak Labels
Label models are used to aggregate weak labels under the weak supervision paradigm, which are in turn used to train the desired end model in the next stage.Existing label models include Majority Voting (MV), Probabilistic Graphical Models (PGM) (Dawid and Skene, 1979;Ratner et al., 2019b;Fu et al., 2020), etc.In this research, we use PGM implement by Ratner et al. (2019b) as our label model g(•), which can be described as (1) This assumes that l x as a random variable for label model.After modeling the relationship between the observed variable l x and unobserved variable y by Bayes methods, the label model obtains the posterior distribution of y given l x by inference process like expectation maximization or variational inference.Finally, we set the maximum value of P(y | l x ) as the hard pseudo-label ỹx of x for later end model training.

Adaptive Ranking-based Sample Selection
We propose an adaptive ranking sample selection approach to simultaneously solve the problems caused by data imbalance and those resulting from the noise generated by the application of procedural rules.First, the classification model f (•; θ) is warmed up with pseudo-labels ỹx , which are used to train the model as a noise indicator that can discriminate noise in the next stage.Then, we continue training the warmed-up model by using the data sampled by adaptive data sampling strategies, including class-wise ranking (CR) and rule-aware ranking (RR) supported by probabilistic margin score (PMS).The training procedures are summarized in Algorithm 1 and the proposed framework is illustrated in Figure 2.
Warm-up.Inspired by Zheng et al. (2020), the prediction of a noisy classification model can be 2. Update θ by Eq. ( 2) before early stop.// Continue training with sample selection.
a good indicator of whether the label of a training data point is clean.Our method relies on a noise indicator trained at warm-up to determine whether each training data is clean.However, a model with sufficient capacity (e.g., more parameters than training examples) can "memorize" each example, overfitting the training set and yielding poor generalization to validation and test sets (Goodfellow et al., 2016).To prevent a model from overfitting noisy data, we warm-up the model f (•; θ) with early stopping (Dodge et al., 2020), and solve the optimization problem by where L denotes a loss function and ỹx is pseudolabel aggregated by a label model.In this research, we do not limit the definition of the loss function; that is, any loss function suitable for a multiclassification task can be used.

Continual training with sample selection.
Noisy and imbalanced labels impair the predictive performance of learning models, especially deep neural networks, which tend to exhibit a strong memorization capability (Wu et al., 2018;Wei et al., 2021).To reduce the influence of imbalance and noise in the classification model, we adopt the intuitive approach of simply continuing the training process with clean and balanced data.In the continual training phase, we filter for clean data filter using a warmed-up model and sample data in a balanced way to calibrate the model.This procedure provides a more robust performance on a noisy, imbalanced dataset.To achieve this, we propose a measurement of label quality called probabilistic margin score (PMS) (Sec.2.4), using two sample selection strategies, including class-wise ranking (CR) (Sec.2.5) and rule-aware ranking (RR) (Sec.2.6).We adopt batch sampling by using CR as S and RR as Q after several steps, using batch U = Q ∪ S to continue training the model by solving the following optimization problem.
Dynamic-Rate Sampling During the training procedure, the model learns from data with easy/frequent patterns and those with harder/irregular patterns in separate training stages (Arpit et al., 2017).According to Chen et al. (2020), a small amount of clean data should be selected first, and then a larger amount of clean data can be selected after the model has been trained well.Therefore, instead of fixing the number of selected data, as the continual training stage proceeds, we linearly increase the size of the training batch U by where B indicates batch size.The sampling ratio γ ranged from 1 to 10 in our experiments.

Probabilistic Margin Score
To measure the impact of the training label on label quality, inspired by Pleiss et al. (2020), we use the margin of output predictions to identify data with low label quality for later adaptive ranking-based sample selection.The predictions of the network can be regarded as a measurement of the quality of labels, which means that a correct prediction indicates high label quality, and vice versa (Jiang et al., 2018;Yu et al., 2019;Yao et al., 2020).Moreover, margins are advantageous because they are simple and efficient to compute during training, and they naturally factorize across samples, which makes it possible to estimate the contribution of each data point to label quality.
Base on this idea, we propose the probabilistic margin score (PMS) to reflect the quality of a given label.PMS is formulated as where f y (•; θ) denotes the prediction of y by classification model f (•; θ) and s(x) ∈ [−1, 1].A negative margin corresponds to an incorrect prediction, and vice versa.The margin captures how much lager the (potentially incorrect) assigned confidence is than all other confidences, which indicates the label quality of x.

Class-wise Ranking
If the top-ranked data are directly selected based on PMS to form the training batch, the majority-class data are highly like to dominate the selected batch.This class imbalance in the data results in suboptimal performance.To overcome this problem, we propose class-wise ranking (CR).Specifically, instead of naively selecting the top-k data ranked by PMS from the entire training set, we select topranked data from each class to maintain the class balance of the resultant batch.However, CR might introduce more noise if we attempted to construct a perfectly class-balanced data batch.Thus, to reduce the noise while ensuring class balance as much as possible, we set a threshold on PMS to filter the noisy data.We assume that the set of data belonging to class y i is X y i and extract the top-k ranked data from each X y i as S (t) y i at step t according to the ranking result from s(X ).We also set a threshold ξ for s(X ) to filter noisy data.
Then, we concatenate S (t) y i drawn from each class as a new batch S (t) as given below.

Rule-aware Ranking
As discussed above, the training labels are synthesized from multiple labeling rules.In the proposed framework, these labeling rules are used not only to produce training labels as in a typical WS pipeline, but also to guide the sample selection.Specifically, each labeling rule typically assigns a certain label, e.g., sports, to only a part of the training set based on its expertise.Within this part of the dataset, some of the data must actually belong to the class sports; otherwise, the relevant labeling rule would be useless.Thus, we propose rule-aware ranking (RR) to separately rank the data covered by each labeling rule and then extract the top-ranked data from each individual ranking.We plan to elaborate on this selection process in a sequel to the present work.
We denote the set of data covered by each labeling rule r i as X r i .We then rank X r i based on PMS s(x) and extract the top-k data from each X r i as Specifically, to avoid the error-propagation, we use the weak label l x instead of ỹx as the new label of x ∈ Q (t) r i when l x is unipolar.We claim that data with high confidence will also have a high probability of weak labels being equal to ground truth labels, especially when multiple rules assign the same weak label to the data.The main reason is that those weak labels are not influenced by the training process and are weakly supervised by human level.Then, we take the union of Q (t) r i drawn from each subset as a new batch Q (t) .
3 Experiment Implementation Details.We choose the Multi-Layer Perceptron (MLP) with 2 hidden layers and RoBERTa (Liu et al., 2019) as the backbone language model for our method in all baselines.In the case of using RoBERTa as a backbone model, rather than training the RoBERTa as a noise indicator in warm-up stage, we used a warmed-up MLP as a noisy indicator to sample training batches for RoBERTa because a large model may easily overfit noisy data in a few steps.We used the classification macro-average F1 score on the test set as the evaluation metric for all datasets.We implemented our method using PyTorch (Paszke et al., 2019) with the WRENCH code-base (Zhang et al., 2021) 2 .

Baselines
Imbalance Learning Methods: (1) Logit Adjustment (LA) (Menon et al., 2020) Table 2: F1-macro result on 2-layer MLP.Comparison among imbalance learning methods, weak supervision methods, and ASR2 (as well as its variants).The ranking of using confidence is identical to that ranked by negative loss.Note that ARS2 outperform all baselines in all imbalanced datasets.
(Dice) (Li et al., 2019): The dice loss method uses dynamically adjusted weight to improve the Sϕrensen-Dice coefficient (a method that values both false positive and false negative and works well for imbalanced datasets), which reduces the impact of easy negative data during training.( 4) LDAM (Cao et al., 2019a): This method encourages the model to treat the optimal trade-off problem between per-class margins.( 5) Effective Number + Logit Adjustment: Because the effective number is a class-wise re-weighting method, we have also combined this method with logit adjustment as a baseline.
Weak Supervision Methods: (1) COSINE (Yu et al., 2020) The COSINE method uses weakly labeled data to fine-tune pre-trained language models by contrastive self-training.
(2) Denoise (Ren et al., 2020): This method estimates the source reliability using a conditional soft attention mechanism and then reducing label noise by aggregating weak labels annotated by a set of rules.

Main Result
Our main results on a 2-layer MLP model are reported in Table 2. Our method outperformed all the baselines on both balanced and imbalanced datasets.The results of the experiment also indicated the following.
The performance of all baseline methods generally showed a downward trend with increasing imbalance ratio, and the decline was more evident in the binary classification problem of Yelp.In contrast, on the multi-classification problem dataset AGNews, although the head class and the tail class exhibited a relatively large gap because the number of each class declined step-by-step, the ratio between the head class and the second head class did not reach the value of the imbalance ratio.Therefore, the model's performance decline on multiclassification was slightly smaller than in binary classification.
On the balanced dataset, the performance of ARS2 was very similar to that of using RR only, whereas the performance of ARS2 (w/o CR) was slightly better than that of ARS2 in AGNews.This occurred because CR did not operate as intended on the balanced dataset but instead affected the diversity of data, resulting in a small gap between ARS2 and RR on the balanced set.The role of CR became gradually more prominent with increasing imbalance ratio because CR was able to balance the original unbalanced sample set such that the model was not affected by the data from the majority class, improving performance.
The baseline performance gradually weakened with the increase in the imbalance ratio.These baselines were all loss-modification-related methods.In the noisy dataset, if a data point originally belonging to the head class is incorrectly marked as belonging to a tail class, the loss modification feature causes the model to assign a larger weight to the data that may have noise, which deepens the error of the model, and results in poor training performance.

Ablation Study
We also examined the importance of CR, RR, and different measurements of label quality.The 8th, 9th and 11th rows of Table 2 summarize the results.It may be observed that the performance of ARS2 without CR on the balanced set was almost the same as that of ARS2, which shows that RR was the main contributor to the balanced set.However, the contributions of both start to converge with increasing imbalance ratio.This shows that the We also compared the performance of two ranking scores.The 11th and 12th rows of Table 2 show that ARS2 with PMS performed better than with confidence, except for the Yelp dataset with an imbalance ratio of 1.However, the difference in performance in this situation was less than 1%.This shows that PMS can reflect the cleanliness of data better, and thus may be considered a more useful ranking score.

Hyperparameter Study
We studied two hyperparameters, namely γ and ξ used in controlling the quality of adaptive sampling.We set the linear ratio from 1 to 10, where higher values of the linear ratio indicate larger amounts of data in the sampled batch.We also set the threshold from -0.5 to 0.5, which indicated the quality of the data of our sampled batch.Figures 5(a) shows that ARS2 is insensitive to γ as the performance is less than 0.8%.In the case of Yelp, ARS2 cannot perform well with a small threshold (less than 0) because the loose threshold may lead a large amount of noisy data to affect the training process.While in the case of AGNews, the high quality of rules helps RR to re-corrected the noisy label, increase the clean data points accessed by model during training.dress the problem of decreasing cleanliness caused by RR.The results also shows that there was noise in the judgment of rules, and using RR alone could not eliminate the weak labels with noise; rather, it further increased the noise of the data.ARS2 generally outperforms a single method with increasing values of the imbalance ratio because the model learns the features of clean data and uses these features to filter out more clean data, gradually improving the cleanliness of the dataset and improving training performance.

Performance on Fine-tuning Pre-trained Model
From Table 3 it may be observed that our optimized RoBERTa training process was able to achieve state-of-the-art results.On the Yelp dataset with ρ = 50, all the baselines could not be fitted well, but the data filtered by our method enable RoBERTa to fit well even under such extreme conditions.This occurred because the fine-tuning process of RoBERTa only requires a small amount of data.Hence, we used warmed-up MLP to dynamically choose a small amount of clean and balanced data for training according to the batch size of RoBERTa to achieve good results.When training RoBERTa, we also linearly increased the training set to expose the model to a greater diversity of data.

Related Works
Weak Supervision.Weak supervision aims to reduce the cost of annotation, and has been widely applied to perform both classification (Ratner et al., 2016b(Ratner et al., , 2019a;;Fu et al., 2020;Yu et al., 2020;Ren et al., 2020) and sequence tagging (Lison et al., 2020;Nguyen et al., 2017;Safranchik et al., 2020;Li et al., 2021;Lan et al., 2020) to help reduce human labor required for annotation.Weak supervision builds on many previous approaches in machine learning, such as distant supervision (Mintz et al., 2009;Hoffmann et al., 2011;Takamatsu et al., 2012), crowdsourcing (Gao et al., 2011;Krishna et al., 2016), co-training methods (Blum and Mitchell, 1998), pattern-based supervision (Gupta and Manning, 2014), and feature annotation (Mann and McCallum, 2010;Zaidan and Eisner, 2008).Specifically, weak supervision methods take multiple noisy supervision sources and an unlabeled dataset as input, aiming to generate training labels to train an end model (two-stage method) or directly produce the end model for the downstream task (single stage method) without any manual annotation.Interested readers are referred to a recent survey (Zhang et al., 2022a) for a brief review of the literature on weak supervision.
Class Imbalance Learning.Four primary methods has been proposed to solve data imbalance problem, including post-hoc correction, loss weighting, data modification and margin modification.Post-hoc correction modifies the logit computed by the model by using the prior of data to bias the training process towards fewer classes with fewer data (Fawcett and Provost, 1996;Provost, 2000;Maloof, 2003;King and Zeng, 2001;Collell et al., 2016;Kim and Kim, 2020;Kang et al., 2019).
Loss weighting weights the loss by the prior distribution of the training set, so that the data of the minority class exhibits a larger loss, and thus the model learns more bias towards the minority class to balance the minority and majority classes (Xie and Manski, 1989;Morik et al., 1999;Menon et al., 2013;Cui et al., 2019;Fan et al., 2017).Data modification balances a training set by increasing the number of data samples of minority classes or decreasing the data for majority classes so that the model is trained without favoring the majority classes or ignoring the minority classes (Kubat et al., 1997;Wallace et al., 2011;Chawla et al., 2002).Margin modification balances the minority and majority classes by increasing the margin of minority class data and majority class data, which makes it easier for a model to learn a discriminative, robust decision boundary between the minority and majority class data (Masnadi-Shirazi and Vasconcelos, 2010;Iranmehr et al., 2019;Zhang et al., 2017;Cao et al., 2019b;Tan et al., 2020).In the case of training on a noisy training set, adding weights or adding data may give incorrect results, causing the model to learn to be biased towards noise.

Conclusion
In this article, we have proposed ARS2 as a method based on the WS paradigm to reduce both noise and the impact of natural imbalances in data.We have proposed PMS to evaluate the level of noise in training data.To reduce the impact of noisy data on training, we have proposed two ranking strategies based on PMS, including CR and RR.Finally, adaptive sampling is performed on the data based on this ranking to clean the data.We have also presented experimental results on eight different datasets, which demonstrate that ARS2 outperformed traditional WS and loss modification methods.

Limitation
First, because the presented work is focused on adaptive clean data sampling, we use the MLP as a teacher model for a large language model like RoBERTa.In the future research, we can consider using co-teaching methods (Han et al., 2018), which provide a more efficient teacher-student structure to train large models, to further improve the efficiency of teacher model.Also, due to the lack of computational resources, we only used a 2-layer MLP and RoBERTa as our backbone model.A larger language model like RoBERTa-large (Liu et al., 2019) could be considered in future work.Finally, the proposed approach could be extended to other tasks such as sequence labeling or natural language inference.

Figure 1 :
Figure 1: Comparison of class distribution between the ground truth labels and labels produced by weak supervision (WS) on TREC dataset.The uncovered piece represents the data not covered by any labeling rule in WS.It may be observed that WS amplified the class imbalance.

Figure 2 :
Figure 2: Overview of ARS2.Our framework has two stages: (1) warm-up, which is used to let the model learn how to distinguish noisy data; (2) continual training with adaptive sampling, which is used to sample clean data.We adopt two different adaptive sampling strategies, including class-wise ranking sampling and rule-aware ranking sampling.

Figure 3 :Figure 4 :
Figure 3: The quality of the selected data in AGNews under each imbalance ratio.

Figure 3 and
Figure 3 and Figure 4 illustrate the cleanliness of the selected data from AGNews and Yelp at each step.It may be observed that ARS2 performed similarly to ARS2 (w/o RR) on the balanced set, which means that combining CR and RR can ad-

Table 3 :
F1-macro result on RoBERTa.Comparison among imbalance learning methods, weak supervision methods, and ASR2 (as well as its variants).Note that ARS2 outperform all baselines in all imbalanced datasets.