ULF: Unsupervised Labeling Function Correction using Cross-Validation for Weak Supervision

A cost-effective alternative to manual data labeling is weak supervision (WS), where data samples are automatically annotated using a predefined set of labeling functions (LFs), rule-based mechanisms that generate artificial labels for the associated classes. In this work, we investigate noise reduction techniques for WS based on the principle of k-fold cross-validation. We introduce a new algorithm ULF for Unsupervised Labeling Function correction, which denoises WS data by leveraging models trained on all but some LFs to identify and correct biases specific to the held-out LFs. Specifically, ULF refines the allocation of LFs to classes by re-estimating this assignment on highly reliable cross-validated samples. Evaluation on multiple datasets confirms ULF's effectiveness in enhancing WS learning without the need for manual labeling.


Introduction
A large part of today's machine learning success rests upon a vast amount of annotated training data. However, a manual expert annotation turns out to be tedious and expensive work. There are different approaches to reduce this data bottleneck: fine-tuning large pre-trained models (Devlin et al., 2019), applying active (Sun and Grishman, 2012) and semi-supervised learning (Kozareva et al., 2008). However, even if in a reduced amount, these approaches still demand manually annotated data, which could be a tremendous challenge in some cases, such as tackling tasks with dynamically changing requirements or in low-resource languages.
Another strategy that does not require any expert manual annotation is weak supervision (WS), which allows getting massive amounts of training data at a low cost.  (Alberto et al., 2015). In (1), both matched LFs correspond to the SPAM class; the sample will be assigned to this class as well. In (2), one of the matched LFs belongs to the SPAM class, while the other -to the HAM class. The easiest way to break the tie is to assign the sample to one of two classes randomly. In (3), no labeling functions matched, meaning the sample does not get any label and may be either filtered out or assigned to a random class.
In a weakly supervised setting, the data is annotated in an automized process using one or multiple weak supervision sources: for example, external knowledge bases (Lin et al., 2016;Mintz et al., 2009) and manually-defined or automatically generated heuristics (Ratner et al., 2020;Varma and Ré, 2018). By mapping such rules, or labeling functions (LFs, Ratner et al., 2020), to a large unlabeled dataset, one could quickly obtain weak training labels, which are, however, potentially error-prone and need additional denoising. Examples are provided in Figure 1.
In this work, we explore methods for improving the quality of weakly supervised data using methods based on the principle of k-fold crossvalidation. The intuition behind them is the following: a model, trained on the substantial part of the data samples and the corresponding weak labels, arXiv:2204.06863v1 [cs.LG] 14 Apr 2022 can predict for the rest of the data more reliable and accurate labels than the weak ones. In this work, we denote such labels as out-of-sample ones since they are calculated based on the other data samples.
The earlier approaches to data cleaning, which follow this idea, deal with general, non-weakly supervised data and, therefore, split the data samples into folds randomly (Northcutt et al., 2021;Wan et al., 2019). However, a direct application of these methods to the weakly labeled data ignores valuable knowledge stemming from the weak supervision process: for example, which LFs matched in each sample and what class each LF corresponds to. In this work, we propose extensions for these methods that leverage this additional source of knowledge via splitting the data considering the labeling functions matched in samples.
Apart from that, we propose ULF -a new method for Unsupervised Labeling Function correction with k-fold cross-validation. Its primary goal is to improve the LFs to classes allocation in order to correct the systematically biased label assignments. ULF calculates the LFs confident matrix and estimates the joint distribution between LFs and output labels using the predicted class probabilities and the original weak labels. The improved allocation allows to calculate the labels anew and apply them in further training. Importantly, ULF also profits from the samples with no LFs matched, in contrast to others that filter them out (Ratner et al., 2020), and improves their labels as well.
Overall, our contributions are: • We propose extensions for two methods originally created for denoising the data using kfold cross-validation based methods: Cross-Weigh (Wang et al., 2019b) and Cleanlab (Northcutt et al., 2021). Our extensions -Weakly Supervised CrossWeigh (WSCW) and Weakly Supervised Cleanlab (WSCL) -profit from the WS-specific information and make the denoising of WS data more accurate.
• We propose our new method ULF for improving the LFs to classes allocation in an unsupervised fashion. ULF not only detects the erroneous LFs to classes allocations, but also improves it, what leads to more accurate labels and a better quality of the trained classifier.
• We demonstrate the effectiveness of both proposed extensions and ULF method compared to the original methods and other baselines on several weakly supervised datasets.
To the best of our knowledge, we are the first (1) to adapt the k-fold cross-validation based noise detection methods to WS domain, and (2) to refine the LFs to classes allocation in WS setting.

Related Work
Weak supervision has been widely applied to different tasks in various domains, such as text classification , relation extraction Hoffmann et al., 2011), named entity recognition (Lan et al., 2020,?;Wang et al., 2019b), video analysis (Fang et al., 2020;Kundu et al., 2019), medical domain (Fries et al., 2021;Saab et al., 2019), image classification (Li et al., 2021), and others. Weak labels are usually cheap and easy-to-obtain, but also potentially error-prone and often need additional denoising.
Some denoising algorithms build a specific model architecture or use the loss functions correction (Karamanolakis et al., 2021;Hedderich and Klakow, 2018;Patrini et al., 2017;Goldberger and Ben-Reuven, 2017;Sukhbaatar et al., 2014;Mnih and Hinton, 2012). Others profit from expert annotations: for example, by adding a set of manually annotated data to the weakly labeled one (Mazzetto et al., 2021;Karamanolakis et al., 2021;Awasthi et al., 2020;Maheshwari et al., 2021;Teljstedt et al., 2015), or learning from user manual correction (Hedderich et al., 2021;Boecking et al., 2021;Saito and Imamura, 2009). All methods that we introduce in this paper, on the contrary, do not require any manual supervision and can be used with any classifier. The works in another popular research direction aim at finding the erroneous labels by estimating the relation between noisy and clean labels (Lange et al., 2019;Northcutt et al., 2021). However, quite often they do not provide any explicit tools for label correction (in contrast to our method ULF).
There is also a group of approaches that share the idea of using k-fold cross-validation for denoising of manually labeled data (Wang et al., 2019a,b;Northcutt et al., 2021;Teljstedt et al., 2015). Wang et al., 2019b detect mistakes in crowdsourcing annotations by training the k models on k − 1 data folds with one fold left for making the predictions by the trained model. The samples where the original noisy labels disagree with the predicted ones are downweighted for further training as unreliable. Northcutt et al. 2021 also make use of a crossvalidation approach, but, instead of predicting labels in hold-out fold, they consider the confidence of the predicted class probabilities. The mislabeled samples are then identified using the ranking with respect to the self-confidence class-dependent thresholds and pruned. Both of these methods can be applied to any data; however, they miss a lot of potentially profitable information when being used as-is for denoising the weakly supervised data.

ULF: Unsupervised Labeling Function Correction
In this section, we present a novel approach ULF -Unsupervised Labeling Functions Correction algorithm. As both CrossWeigh (Wang et al., 2019b) and Cleanlab (Northcutt et al., 2021), it exploits the idea of k-fold cross-validation training. However, while those algorithms only detect the unreliable annotations, ULF correct them by refining the LFs to classes allocation.

Motivation
Inaccurate and erroneous allocation of LFs to classes often becomes a cause of noise in weakly supervised setting. For example, in Figure 1, among the LFs used to annotate the YouTube Dataset, there is a LF "my" that corresponds to the SPAM class. The reason for that are the often encountered spam messages like "subscribe to my channel" or "follow my page". However, such correspondence is by no means so straightforward and sometimes may be potentially misguiding. Thus, a label for Sample 2, where this LF matched alongside another one from class HAM, cannot be defined clearly by LFs, as both classes gain 50% probability, which leads to random label assignment. However, if the information about "my" being related to the HAM class as well were considered, the overall HAM probability would dominate, which would make this sample being classified to the correct HAM class.

Preliminaries
Let us introduce some formal notation used in the current and in the following sections. Given a dataset X with N data samples, X = {x 1 , x 2 , ..., x N }; each sample is a set of words (i.e., one or several sentences). This set is used for training a classifier with K output classes, K = {k 1 , k 2 , ..., k K }. In the weakly supervised setting, there are no known-to-be-correct true labels for training samples. Instead, we are provided with a set of labeling functions L, L = {l 1 , l 2 , ..., l L } (LFs). We say that a LF l j matches a sample x i if it maps this sample to a label. For example, in case of keyword-based LFs, a LF matches a sample if the keyword is present in it and, thus, a corresponding label is assigned to this data sample. A set of LFs matched in a sample x i is denoted as L x i . In each x i there can be either one LF (|L matched. This information can be saved in a binary matrix Z N ×L , where Z ij = 1 means that l j matches in sample x i . Each label function l j corresponds to some class k i (in other words, the label function l j assigns the corresponding label to samples where l j matches). The information about this correspondence is stored in a binary matrix T L×K , where T ij = 1 means LF l i corresponds to class k j . By multiplying Z and T , applying majority voting, and breaking the ties randomly, we could obtain the potentially noisy labels Y = {y 1 , y 2 , ..., y n }, y j ∈ K, which can be used for training a classifier with parameters θ.
An illustration of Z, T , and Y matrices is presented in Figure 3 (we are not considering the de-noisedT matrix yet).

Overview
Following the notation defined in Section 3.2, the main goal of the ULF algorithm is to improve the matrix of LFs to classes assignments T . It is done by, firstly, calculating the reliable probabilities of training samples belonging to each class with kfold cross-validation and, secondly, building an LFs-to-classes confidence matrix C L×K . The further combining of this confidence matrix, which reflects the out-of-sample predicted LFs to classes assignment, and the original matrix T results in an adjustedT matrix, which could be used to calculate new, more accurate training labels Y . The graphical illustration is provided in Figure 3.

Label Probabilities with Cross-Validation
Firstly, the class probabilities for each training sample are predicted by the k-fold cross-validation. For that, we use the unlabeled training set X and weak labels Y obtained by multiplying Z and T matrices.
There are three possible ways of data splitting: • randomly (ULF rndm ): assigning the samples to folds same way as it would be done in standard k-fold cross-validation not considering which LFs are matching; • by labeling functions (ULF lfs ): the same way it is done in WSCW (refer to Section 4.1 for more details) • by signatures (ULF sgn ): for each training sample x i we take the set of matching LFs L x i as its signature. Now the k folds {f 1 , ..., f k } contain the signatures and not LFs as in WSCW. After that, the signatures are split into k folds, each of which becomes a test fold in turn, while others build training folds. After training k models separately on X train i , i ∈ [1, k], and making the predictions on the holdout folds X test i , we obtain a matrix with out-ofsample predicted probabilitiesP N ×K . From these probabilities the reliable labelsŷ are derived using the expected average thresholds t k j (here, we followed the approach of (Northcutt et al., 2021)):ŷ where

Re-estimate Labeling Functions
In order to refine the LFs to classes allocation, we are going to estimate the joint distribution between matched LFs and predicted labels. This estimation will later adjust initial LFs-to-classes correspondence stored in T matrix. For each LF l i , we calculate the number of samples with this LF matched and confidently assigned to each class k j . This information is saved as a LFs-confident matrix C L×K : Next, the confident matrix is calibrated and normalized toQ L×K to correspond with the values in the Z matrix. The confident matrix should sum up to the overall number of training samples and the sum of counts for each LF should be the same as in the original Z matrix: where i∈L, The joint matrixQ l,ŷ can now be used for tuning the LFs-to-class matrix T that contains the initial LFs to class allocations. T andQ l,ŷ are summed with multiplying coefficients p and 1 − p, where p ∈ [0, 1]. The value of p determines how much information from the estimated assignment matrix Q l,ŷ should be preserved in the refined matrixT .
With the multiplication of Z and the newly es-timatedT matrices, the new set of labels Y upd is calculated. Now it can be used either for rerunning the estimation process or for training the final classifier. After all iterations are done, the final classifier is trained on the training set X annotated by improved labels Y . Note that, in contrast to Northcutt et al., 2021, we do not eliminate any training samples that are considered to be unreliable but use them all with corrected labels.
Unlabeled instances. One of the challenges in weak supervision is the data samples where no LF matched. In some approaches, such samples are filtered out, while in others they are kept as belonging to a random class or to the other class. In ULF, such samples are initially assigned with random labels and may be partly involved in cross-validation training. The proportion of randomly labeled data included in each ktraining fold is defined with hyper-parameter λ: |{x i |L x i = ∅}| = λ · |{x j |L x j = ∅}|. After eacĥ T matrix recalculation, their new labels are calculated directly from the out-of-samples predicted probabilities:ŷ i = argmaxp(y = j; x i , θ), wherê p(y = j; x i , θ) ≥ t j .

Denoising of Weakly Supervised Data using Cross-validation
In order to find the suitable baselines that we could use for the comparison, we discuss two frameworks, CrossWeigh (Wang et al., 2019b), and Cleanlab (Northcutt et al., 2021), which also use the method of cross-validation to improve the data annotations. Both of these methods were initially proposed for denoising the manually annotated data. In order to provide even more fair comparison with our method ULF, we introduce extensions of these frameworks for weakly supervised setting: Weakly Supervised CrossWeigh (WSCW) and Weakly Supervised Cleanlab (WSCL). In contrast to the original methods, our extensions leverage the LFs information, what results in more accurate results.

Weakly Supervised CrossWeigh (WSCW)
CrossWeigh (CW, Wang et al. 2019b) was proposed for tracing inconsistent labels in the crowdsourced annotations for the NER task. The algorithm traces the constantly mislabeled entities by training multiple models in a cross-validation fashion on different data subsets and comparing the predictions on unseen data with its crowdsourced labels. This approach is promising for detecting the unreliable LFs (which can be considered as samples' annotators) in the weakly supervised data as well. If a potentially erroneous LF systematically annotates the samples wrongly, a reliable model trained on data without it will not make this mistake in its prediction, and, thus, the error will be traced and reduced. Thus, we propose a Weakly Supervised CrossWeigh (WSCW) with splitting the data considering the LFs. By eliminating the LFs matched in samples in a test fold from training folds, we refrain the model from making easy predictions and let it deduce the labels basing on the unseen LFs only.
More formally, we randomly split LFs L into k folds {f 1 , ..., f k } and iteratively take LFs from each fold f i as test LFs, while the others -as the training LFs. All samples where no LFs from holdout fold match become training samples, while the rest are used for testing.
After that we train the k separate models on X train i . The labelsŷ predicted by the trained model for the samples in X test i are then compared to the initial noisy labels y. All samples X j wherê y j = y j are claimed to be potentially mislabeled; their influence is reduced in further training. The whole procedure of errors detection is performed t times with different partitions to refine the results.

Weakly Supervised Cleanlab (WSCL)
The Cleanlab framework (Northcutt et al., 2021) allows to find erroneous labels by estimating the joint distribution between the noisy labels and out-ofsample labels calculated by k-fold cross-validation. In our extension Weakly Supervised Cleanlab (WSCL), we follow a similar approach, but adapt it to the weak supervision the same way as in WSCW: the cross-validation sets X train i and X test i sets are built considering the matched LFs. These sets are used to train k models, which predict the class probabilities for the test samples. The labelsŷ are calculated later on with respect to the class expected self-confidence value t j (see Northcutt et al. 2021 for more details).
A sample x i is considered to confidently belong to class j ∈ K if the probability of class j is greater than expected self-confidence for this class t j or the maximal one in case of several classes are probable. The samples with no probability exceeding the thresholds have no decisive label and do not participate in the further denoising.
After that, a class-to-class confident joint matrix C K×K is calculated, where: Notably, C contains only the information about correspondence between noisy and out-of-sample predicted labels (the same way as in Northcutt et al. 2021). So, it only counts the samples with presumably incorrect noisy labels y, but does not provide us with any insights about what LFs assigned these noisy labels and, thus, can be claimed erroneous (in contrast to the ULF approach, see Section 3).
The confident matrix C is then calibrated and normalized in order to obtain an estimate matrix of the joint distribution between noisy and predicted labelsQ K×K , which determines the number of samples to be pruned. We perform the pruning by noise rate following the Cleanlab default setting: n ·Q, i = j samples with max(p(y = j) −p(y = i)) are eliminated in further training.

Experiments
Datasets. We evaluate our methods on four wellknown weakly supervised English datasets (the amount of covered tasks and language limitation of datasets certainly leaves room for future work): (1) YouTube Spam Classification dataset (Alberto et al., 2015), also used in (Ratner et al., 2020); (2) SMS Spam Classification dataset (Almeida et al., 2011); (3) Question Classification dataset from TREC-6 (Hovy et al., 2001;Li and Roth, 2002); (4) Spouse Relation Classification dataset based on the Signal Media One-Million News Articles Dataset (Corney et al., 2016), also used in (Ratner et al., 2020). For a fair comparison, the same LFs and evaluation metrics as in previous works were used. The data statistics is provided in Table 5.
All datasets used in experiments have their own peculiarities. For example, in the Spouse dataset, there are 75% of samples not covered with any LFs, while in the rest 25% there is a high ratio of LF overlappings.
Baselines. We compare our algorithm against two baselines. (1) Majority + Training: the classifier is trained on the data and noisy labels acquired with simple majority voting and randomly broken ties.
In order to evaluate the performance of our weakly supervised adaptations of CrossWeigh and Cleanlab, we also introduce the original frameworks as baselines: (3) CrossWeigh (Wang et al., 2019b), (4a) Cleanlab -the original implementation of Northcutt et al. 2021 with Scikit-learn library, (4b) Cleanlab-PyTorch -our PyTorch-based reimplementation of the Cleanlab algorithm in order to provide a fair comparison with our methods implemented with PyTorch library (Paszke et al., 2019). Following (Northcutt et al., 2021), we use a logistic regression model in our experiments. Implementation Details. The training data were encoded with TF-IDF vectors. For training, we set the number of epochs to 20 and applied the early stopping with patience = 5. We ran each experiment 10 times; the models which showed the best scores on the development set were evaluated on the test sets. Development set is also used to estimate the number of iterations I: initially, it is set to I = 20, but if training labels do not change after three iterations, the algorithm stops, and the Majority + Training 86.8 ± 0.5 41.2 ± 0.2 55.9 ± 0.0 80.0 ± 0.2 Snorkel-DP (Ratner et al., 2020) 88.0 ± 0.0 41.5 ± 0.7 42.2 ± 2.2 82.8 ± 0.1 CrossWeigh (Wang et al., 2019b) 90.1 ± 0.6 41.6 ± 0.5 40.0 ± 0.1 79.6 ± 1.4 Cleanlab (Northcutt et al., 2021) 78.0 ± 0.  dova et al., 2021), which by providing an access to all WS components allowed us to implement and benchmark all algorithms described above. 1 All experiments were performed on a machine with CPU frequency of 2.2GHz with 40 cores; the full set up took on average 20 hours for each dataset.

Experimental Results
In Table 1, there are the average results across all datasets reported with the standard error of the mean. Due to the skewness of the Spouse and SMS datasets, we report the F1 score for them; for other datasets, an accuracy score is provided.
Overall, our WSCW and WSCL extensions of CrossWeigh and Cleanlab frameworks outperform the corresponding base methods in most cases and lead to consistent improvements compared to the Majority and Snorkel-DP baselines. It proves our hypothesis of LFs' importance in applying crossvalidation techniques for weak supervision. At the same time, the ULF algorithm shows the best result overall on most of the datasets. ULF sgn outperforms the classifier trained on the data with label chosen with majority voting without any additional denoising by 6.8% on average and Snorkel-DP by 1 The code will be publicly released on acceptance. 9.2%, ULF rndm -by 4.4% and 6.7% respectively. Interestingly, the ULF lfs demonstrates a worse result compared to ULF sign and even ULF rndm , which could be explained by multiple LF overlappings that makes using them for splitting the data in scope of ULF a complicated task. The analysis of hyperparameter values in Tables 3 and 4 showed that in most cases the signature-based data splitting requires smaller number of folds k and iterations I compared to the random data splitting. In the meantime, the values of multiplying coefficient p and non-labeled data rate λ, being rather dataand not algorithm-specific parameters, remain the same. Note that our method does not involve any manually annotated data and, therefore, cannot be directly compared to the results reported for settings that include it (Karamanolakis et al., 2021;Awasthi et al., 2020). Figure 4 demonstrates several samples from YouTube and SMS datasets and their erroneous labels that were detected and changed by ULF lfs . While Sample 1 is definitely a SPAM comment, no SPAM LF matched in it (since the potentially spam keyword "subscribe" is here reduced to "sub"). However, a LF short_comment corresponding to class HAM matched there, which results in assigning a wrong label HAM to it. Sample 2 is a SPAM message not covered by any keywords-based LFs: while they are primarily concerned with the cases where a user asks to subscribe to the channel (using the key words "subscribe", "my"), here there is a request to like the comment. Thus, a lack of LFs involves misclassification. However, in Sample 3, a word "subscribe" is mentioned indeed, but in a related to the video context, what makes this message not-SPAM in contrast to the assigned label HAM in effect. In the same way, keyword "password", which is defined as a LF corresponding to SPAM class in SMS dataset (in order to detect spam messages like Send me your id and password), is matched in a HAM message and annotated it with SPAM label. A different one is Sample 5: here, two LFs from different classes match; as a result, the label SPAM was defined not by LFs but randomly. All of these various misclassifications were corrected by ULF lfs , and the corresponding samples were reassigned to the correct classes.

Conclusion
In our work, we explored the k-fold crossvalidation based methods for denoising the weakly annotated data. Firstly, we introduced our extensions of two frameworks, initially proposed for denoising the manually annotated data: CrossWeigh (Wang et al., 2019b) and Cleanlab (Northcutt et al., 2021). We adapted them for weakly supervised setting by leveraging the information about labeling functions matching in the data samples and proved the efficiency of our adaptations. In the second part of our work, we propose the ULF algorithm for unsupervised labeling functions denoising. Based on counting the confident labeling functions-to-classes matrix and estimating their joint distribution, ULF helps to refine the labeling functions to classes allocations without any manually annotated data, allowing for the correction (and not just removal) of the weak labels. Thus, the entire training set can then be used for training a reliable classifier without any data reduction.