Semi-Supervised Data Programming with Subset Selection

The paradigm of data programming, which uses weak supervision in the form of rules/labelling functions, and semi-supervised learning, which augments small amounts of labelled data with a large unlabelled dataset, have shown great promise in several text classification scenarios. In this work, we argue that by not using any labelled data, data programming based approaches can yield sub-optimal performances, particularly when the labelling functions are noisy. The first contribution of this work is an introduction of a framework, \model which is a semi-supervised data programming paradigm that learns a \emph{joint model} that effectively uses the rules/labelling functions along with semi-supervised loss functions on the feature space. Next, we also study \modelss which additionally does subset selection on top of the joint semi-supervised data programming objective and \emph{selects} a set of examples that can be used as the labelled set by \model. The goal of \modelss is to ensure that the labelled data can \emph{complement} the labelling functions, thereby benefiting from both data-programming as well as appropriately selected data for human labelling. We demonstrate that by effectively combining semi-supervision, data-programming, and subset selection paradigms, we significantly outperform the current state-of-the-art on seven publicly available datasets. \footnote{The source code is available at \url{https://github.com/ayushbits/Semi-Supervised-LFs-Subset-Selection}}


Introduction
Modern machine learning techniques rely on large amounts of labelled training data for text classification tasks such as spam detection, (movie) genre classification, sequence labelling, etc. Supervised learning approaches have utilised such large amounts of labelled data and, this has resulted in 1 The source code is available at https://github.com/ayushbits/ Semi-Supervised-LFs-Subset-Selection huge successes in the last decade. However, the acquisition of labelled data, in most cases, entails a painstaking process requiring months of human effort. Several techniques such as active learning, distant supervision, crowd-consensus learning, and semi-supervised learning have been proposed to reduce the annotation cost (Settles et al., 2008). However, clean annotated labels continue to be critical for reliable results (Bach et al., 2019;Goh et al., 2018). Recently, Ratner et al. (2016) proposed a paradigm on data-programming in which several Labelling Functions (LF) written by humans are used to weakly associate labels with the instances. In data programming, users encode the weak supervision in the form of labelling functions. On the other hand, traditional semi-supervised learning methods combine a small amount of labelled data with large unlabelled data (Kingma et al., 2014). In this paper, we leverage semi-supervision in the feature space for more effective data programming using labelling functions.

Motivating Example
We illustrate the LFs on one of the seven tasks on which we experiment with, viz., identifying spam/no-spam comments in the YouTube reviews. For some applications, writing LFs is often as simple as using keyword lookups or a regex expression. In this specific case, the users construct heuristic patterns as LFs for classifying spam/not-spam comments. Each LF takes a comment as an input and provides a binary label as the output; +1 indicates that the comment is spam, -1 indicates that the comment is not spam, and 0 indicates that the LF is unable to assert anything for the comment (referred to as an abstain). Table 1 presents a few example LFs for spam and non-spam classification.
In isolation, a particular LF may neither be always correct nor complete. Furthermore, the LFs Id Description LF1 If http or https in comment text, then return +1 otherwise ABSTAIN (return 0) LF2 If length of comment is less than 5 words, then return -1 otherwise ABSTAIN (return 0).(Non spam comments are often short) LF3 If comment contains my channel or my video, then return +1 otherwise ABSTAIN (return 0). Table 1: Three LFs based on keyword lookups or regex expression for the YouTube spam classification task may also produce conflicting labels. In the past, generative models such as Snorkel (Ratner et al., 2016) and CAGE (Chatterjee et al., 2020) have been proposed for consensus on the noisy and conflicting labels assigned by the discrete LFs to determine the probability of the correct labels. Labels thus obtained could be used for training any supervised model/classifier and evaluated on a test set. We will next highlight a challenge in doing data programming using only LFs that we attempt to address. For each of the following sentences S 1 . . . S 6 that can constitute an observed set of training instances, we state the value of the true label (±1). While the candidates in S 1 and S 4 are instances of a spam comment, the ones in S 2 and S 3 are not. In fact, these examples constitute one of the canonical cases that we discovered during the analysis of our approach in Section 4.4. 1. S 1 , +1 : Please help me go to college guys! Thanks from the bottom of my heart.
S 6 , +1 : Watch Maroon 5's latest ... www.youtube.com/watch?v=TQ046FuAu00 In Table 2, we present the outputs of the LFs as well as some n-gram features F1 ('.com') and F2 ('This song') on the observed training examples S 1 , S 2 , S 3 and S 4 as well as on the unseen test examples S 5 and S 6 . For S 1 , the correct consensus can easily be performed to output the true label +1, since LF1 (designed for class +1) gets triggered, whereas LF2 (designed for class -1) is not triggered. Similarly, for S 2 , LF2 gets triggered whereas LF1 is not, making it possible to easily perform the correct consensus. Hence, we have treated S 1 and S 2 as unlabelled, indicating that we could learn a model based on LFs alone without supervision if all we observed were these two examples and the outputs of LF1 and LF2. However, the correct consensus on S 3 and S 4 is challenging since both LF1 and LF2 either fire or do not. While the (n-gram based) features F1 and F2 appear to be informative and could potentially complement LF1 and LF2, we can easily see that correlating feature values with LF outputs is tricky in a completely unsupervised setup. To address this issue, we ask the following questions: (A) What if we are provided access to the true labels of a small subset of instances -in this case, only S 3 and S 4 ? Could the (i) correlation of features values (eg. F1 and F2) with labels (eg. +1 and -1 respectively), modelled via a small set of labelled instances (eg. S 3 and S 4 ), in conjunction with (ii) the correlation of feature values (eg. F1 and F2) with LFs (eg. LF1 and LF2) modelled via a potentially larger set of unlabelled instances (eg. S 1 , S 2 ), help improved prediction of labels for hitherto unseen test instances S 5 and S 6 ? (B) Can we precisely determine the subset of the unlabelled data that, when labelling would help us train a model (in conjunction with the labelling functions) that is most effective on the test set? In other words, instead of randomly choosing the labelled dataset for doing semi-supervised learning (part A), can we intelligently select the labelled subset? In the above example, choosing the labelled set as S 3 , S 4 would be much more useful than choosing the labelled set as S 1 , S 2 . As a solution to (A), in Section 3.3, we present a new formulation, SPEAR, in which the parameters over features and LFs are jointly trained in a semi-supervised manner. SPEAR expands as Semi-suPervisEd dAta pRogramming. As for (B), we present a subset selection recipe, SPEAR-SS (in Section 3.4), that recommends the sub-set of the data (e.g. S 3 and S 4 ), which, after labelling, would most benefit the joint learning framework.

Our Contributions
We summarise our main contributions as follows: To address (A), we present SPEAR (c.f., Section 3.3), which is a novel paradigm for jointly learning the parameters over features and labelling functions in a semi-supervised manner. We jointly learn a parameterized graphical model and a classifier model to learn our overall objective. To address (B), we present SPEAR-SS (c.f., Section 3.4), which is a subset selection approach to select the set of examples which can be used as the labelled set by SPEAR. We show, in particular, that through a principled data selection approach, we can achieve significantly higher accuracies than just randomly selecting the seed labelled set for semi-supervised learning with labelling functions. Moreover, we also show that the automatically selected subset performs comparably or even better than the handpicked subset by humans in the work reported by Awasthi et al. (2020), further emphasising the benefit of subset selection for semi-supervised data programming. Our framework is agnostic to the underlying network architecture and can be applied using different underlying techniques without a change in the meta-approach. Finally, we evaluate our model on seven publicly available datasets from domains such as spam detection, record classification, and genre prediction and demonstrate significant improvement over state-of-the-art techniques. We also draw insights from experiments in synthetic settings (presented in the appendix).

Related Work
Data Programming and Unsupervised Learning: Snorkel (Ratner et al., 2016) has been proposed as a generative model to determine correct label probability using consensus on the noisy and conflicting labels assigned by the discrete LFs. Chatterjee et al. (2020) proposed a graphical model, CAGE, that uses continuous-valued LFs with scores obtained using soft match techniques such as cosine similarity of word vectors, TF-IDF score, distance among entity pairs, etc. Owing to its generative model, Snorkel is highly sensitive to initialisation and hyper-parameters. On the other hand, the CAGE model employs user-controlled quality guides that incorporate labeller intuition into the model. However, these models completely disregard feature information that could provide additional information to learn the (graphical) model. These models try to learn a combined model for the labelling functions in an unsupervised manner. However, in practical scenarios, some labelled data is always available (or could be made available by labelling a few instances); hence, a completely unsupervised approach might not be the best solution. In this work, we augment these data programming approaches by designing a semi-supervised model that incorporates feature information and LFs to learn the parameters jointly. Hu et al. (2016) proposed a student-teacher model that transfers rule information by assigning linear weight to each rule based on an agreement objective. The model we propose in this paper jointly learns parameters over features and rules in a semi-supervised manner rather than just weighing their outputs and can therefore be more expressive. Semi-Supervised Data Programming: The only work which, to our knowledge, combines rules with supervised learning in a joint framework is the work by Awasthi et al. (2020). They leverage both rules and labelled data by associating each rule with exemplars of correct firings (i.e., instantiations) of that rule. Their joint training algorithms denoise over-generalized rules and train a classification model. Our approach differs from their work in two ways: a) we do not have information of rule exemplars -thus our labelled examples need not have any correspondence to any of the LFs (and may instead complement the LFs as illustrated in Table 2) and b) we employ a semi-supervised framework combined with graphical model for consensus amongst the LFs to train our model. We also study how to automatically select the seed set of labelled data, rather than having a human provide this seed set, as was done in (Awasthi et al., 2020).
Data Subset Selection: Finally, another approach that has been gaining a lot of attention recently is data subset selection. The specific application of data subset selection depends on the goal at hand. Data subset selection techniques have been used to reduce end to end training time (Mirzasoleiman et al., 2019;Kaushal et al., 2019;Killamsetty et al., 2021) and to select unlabelled points in an active learning manner to label (Wei et al., 2015;Sener and Savarese, 2017) or for topic summarization (Bairi et al., 2015). In this paper, we present a framework (SPEAR-SS) of data subset selection for selecting a subset of unlabelled examples for obtaining labels complementary to the labelling functions.

Problem Description
Let X and Y ∈ {1...K} be the feature space and label space, respectively. We also have access to m labelling functions (LF) λ 1 to λ m . As mentioned in Section 1.1, each LF λ j is designed to record some class; let us denote 2 by k j ∈ {1...K}, the class associated with λ j . The dataset consists of 2 components, viz., Here, the vector l i = (l i1 , l i2 , . . . , l im ) denotes the firings of all the LFs on instance x i . Each l ij can be either 1 or 0. l ij = 1 indicates that the LF λ j has fired on the instance i and 0 indicates it has not. All the labelling functions are discrete; hence, no continuous scores are associated with them.

Classification and Labelling Function Models
SPEAR has a feature-based classification model f φ (x) which takes the features as input and predicts the class label. Examples of f φ (x) we consider in this paper are logistic regression and neural network models. The output of this model is P f φ (y|x), i.e., the probability of the classes given the input features. This model can be a simple classification model such as a logistic regression model or a simple neural network model.
We also use an LF-based graphical model P θ (l i , y) which, as specified in equation (1) for an example x i , is a generative model on the LF 2 We use the association of LF λj with some class kj only in the quality guide component (QG) of the loss in eqn. 3 The label probabilities as per the feature-based model f φ P θ The label probabilities as per the LF-based Graphical Model LCE Cross Entropy Loss:

Label Prediction from the LF-based graphical model LLs
Supervised negative log likelihood LLu Unsupervised negative log likelihood summed over labels KL KL Divergence between two probability models R Quality Guide based loss outputs and class label y.
There are K parameters θ j1 , θ j2 ...θ jK for each LF λ j , where K is the number of classes. The model makes the simple assumption that each LF λ j independently acts on an instance x i to produce outputs l i1 , l 1i ...l im . The potentials ψ θ invoked in equation (1) are defined in equation (2). Z θ is the normalization factor. We propose a joint learning algorithm with semi-supervision to employ both features and LF predictions in an end-to-end manner.

Joint Learning in SPEAR
We first specify the objective of SPEAR and thereafter explain each of its components in greater detail: Before we proceed further, we refer the reader to Table 3 in which we summarise the notation built so far as well as the notation that we will soon be introducing. First Component (L1): The first compo- is the standard crossentropy loss on the labelled dataset L for the model P f φ . Second Component (L2): The second component L2 is the semi-supervised loss on the unlabelled data U. In our framework, we can use any unsupervised loss function. However, for this paper, we use the Entropy minimisation (Grandvalet and Bengio, 2005) approach. Thus, our second component H P f φ (y|x i ) is the entropy of the predictions on the unlabelled dataset. It acts as a form of semi-supervision by trying to increase the confidence of the predictions made by the model on the unlabelled dataset.
is the cross-entropy of the classification model using the hypothesised labels from CAGE (Chatterjee et al., 2020) on U. Given that l i is the output vector of all labelling functions for any x i ∈ U, we specify the predicted label for x i using the LF-based graphical model P θ (l i , y) from eqn. (1) as: g(l i ) = argmax y P θ (l i , y) Fourth Component (L4): The fourth component LL s (θ|L) is the (supervised) negative log likelihood loss on the labelled dataset L as per eqn. (3): is the negative log likelihood loss for the unlabelled dataset U as per eqn. (3). Since the true label information is not available, the probabilities need to be summed over y: is the Kullback-Leibler (KL) divergence between the predictions of both the models, viz., feature-based model f φ and the LF-based graphical model P θ summed over every example x i ∈ U ∪ L. Through this term, we try and make the models agree in their predictions over the union of the labelled and unlabelled datasets. Quality Guides (QG): As the last component in our objective, we use quality guides R(θ|{q j }) on LFs, which have been shown in (Chatterjee et al., 2020) to stabilise the unsupervised likelihood training while using labelling functions. Let q j be the fraction of cases where λ j correctly triggered, and let q t j be the user's belief on the fraction of examples x i where y i and l ij agree. If the user's beliefs were not available, we consider the precision of the LFs on the validation set as the user's beliefs. Except for the SMS dataset, we take the precision of the LFs on the validation set as the quality guides. If P θ (y i = k j |l ij = 1) is the model-based precision over the LFs, the quality guide based loss can be expressed as R(θ|{q t j }) = j q t j log P θ (y i = k j |l ij = 1)+(1−q t j ) log(1−P θ (y i = k j |l ij = 1)). Throughout the paper, we consider QG always in conjunction with Loss L5.
In summary, the first three components (L1, L2 and L3) invoke losses on the supervised model f φ . While L1 compares the output f φ against the ground truth in the labelled set L, L2 and L3 operate on the unlabelled data U by minimizing the entropy of f φ (L2) and by calibrating the f φ output against the noisy predictions g(l i ) of the graphical model P θ (l i , y) for each x i ∈ U (L3). The next two components L4 and L5 focus on maximizing the likelihood of the parameters θ of P θ (l i , y) over labelled x i ∈ L and unlabelled x i ∈ U datasets respectively. Finally, in L6, we compare the probabilistic outputs from the supervised model f φ against those from the graphical model P θ (l, y) through a KL divergence based loss. We use the ADAM (stochastic gradient descent) optimizer to train the non-convex loss objective.
Previous data programming approaches (Bach et al., 2019; Chatterjee et al., 2020) adopt a cascaded approach in which they first optimise a variant of L5 to learn the θ parameters associated with the LFs and thereafter use the noisily generated labels using g(l) to learn the supervised model f φ using a variant of L3. In contrast, our approach learns the LF's θ parameters and the model's φ parameters jointly in the context of the unlabelled data U.
We present synthetic experiments to illustrate the effect of SPEAR for data programming and semisupervision in a controlled setting in which (i) the overlap between classes in the data is controlled and (ii) the labelling functions are accurate. The details of the synthetic experiments are provided in the appendix.

SPEAR-SS: Subset Selection with SPEAR
Suppose we are given an unlabelled data set U and a limited budget for data labelling because of the costs involved in it. It is essential for us to choose the data points that need to be labelled properly. We explore two strategies for selecting a subset of data points from the unlabelled set. We then obtain the labels for this subset, and run SPEAR on the combination of this labelled and unlabelled set. The two approaches given are intended to maximise diversity of the selected subset in the feature space. We complement both the approaches with Entropy Filtering (also described below).
Unsupervised Facility Location: In this approach, given an unlabelled data-set U, we want to select a subset S such that the selected subset has maximum diversity with respect to the features. Inherently, we are trying to maximise the information gained by a machine learning model when trained on the subset selected. The objective function for unsupervised facility location is f unsup (S) = i∈U max j∈S σ ij where σ ij denotes the similarity score (in the feature space X ) between data instance x i in unlabelled set U and data instance x j in selected subset data S. We employ a lazy greedy strategy to select the subset. In conjunction with Entropy Filtering described below,we call this technique Unsupervised Subset Selection. Supervised Facility Location: The objective function for Supervised Facility Location (Wei et al., 2015) is f sup (S) = y∈Y i∈Uy max j∈S∩Uy σ ij .
Here we assume that U y ⊆ U is the subset of data points with hypothesised label y. Simply put, U y forms a partition of U based on the hypothesized labels obtained by performing unsupervised learning with labelling functions. In conjunction with Entropy Filtering, we call this technique Supervised Subset Selection. Entropy Filtering: We also do a filtering based on entropy. In particular, we sort the examples based on maximum entropy and select f B number of data points 3 , where B is the data selection budget (which was set to the size of the labelled set |L| in all our experiments). On the filtered dataset, we perform the subset selection, using either the supervised or unsupervised facility location as described above. Below, we describe the optimisation algorithm for subset selection. Optimisation Algorithms and Submodularity: Both f unsup (S) and f sup (S) are submodular functions. We select a subset S of the filtered unlabelled data, by maximising these functions under a cardinality budget k (i.e., a labelling budget). For cardinality constrained maximisation, a simple greedy algorithm provides a near-optimal solution (Nemhauser et al., 1978). Starting with S 0 = ∅, we sequentially update where f (j|S) = f (S ∪ j) − f (S) is the gain of adding element j to set S. We iteratively execute the greedy step (4) until t = k and |S t | = k. It 3 In our experiments, we set f = 5 is easy to see that the complexity of the greedy algorithm is O(nkT f ), where T f is the complexity of evaluating the gain f (j|S) for the supervised and unsupervised facility location functions. We then significantly optimize this simple greedy algorithm via a lazy greedy algorithm (Minoux, 1978)

Experiments
In this section, we (1) evaluate our joint learning against state-of-the-art approaches and (2) demonstrate the importance of subset selection over random subset selection. We present evaluations on seven datasets on tasks such as text classification, record classification and sequence labelling.

Datasets
We adopt the same experimental setting as in Awasthi et al. (2020) for the dataset split and the labelling functions. However (for the sake of fairness), we set the validation data size to be equal to the size of the labelled data-set unlike Awasthi et al. (2020) in which the size of the validation set was assumed to be much larger. We use the following datasets: (1) YouTube: A spam classification on YouTube comments; (2) SMS Spam Classification (Almeida et al., 2011), which is a binary spam classification dataset containing 5574 documents; (3) MIT-R (Liu et al., 2013), is a sequence labelling task on each token with following labels: Amenity, Prices, Cuisine, Dish, Location, Hours, Others; (4) IMDB, which is a plot summary based movie genre binary classification dataset, and the LFs (and the labelled set) are obtained from the approach followed by Varma and Ré (2018); (5) Census (Dua and Graff, 2017), (6) Ionosphere, and (7) Audit, which are all UCI datasets. The task in the Census is to predict whether a person earns more than $50K or not.
Ionosphere is radar binary classification task given a list of 32 features. The task in the Audit is to classify suspicious firms based on the present and historical risk factors.
Statistics pertaining to these datasets are presented in Table 4. Since we compare performances against models that adopt different terminology, we refer to rules and labelling functions interchangeably. For fairness, we restrict the size of the validation set and keep it equal to the size |L| of the labelled set. For all experiments involving comparison with previous approaches, we used code and hyperparameters from (Awasthi et al., 2020) but with our smaller-sized validation set. Note that we mostly outperform them even with their largersized validation set as can be seen in Table 5. More details on training and validation set size are given in the appendix.

Baselines
In Table 5, we compare SPEAR and SPEAR-SS against other following standard methods on seven datasets. Only-L: We train the classifier P θ (y|x) only on the labelled data L using loss component L1. As explained earlier, following (Awasthi et al., 2020), we observe that a 2-layered neural network trained with the small amount of labelled data is capable of achieving competitive accuracy. We choose this method as a baseline and report gains over it. L + U maj : We train the baseline classifier P θ (y|x) on the labelled data L along with U maj where labels on the U instances are obtained by majority voting on the rules/LFs. The training loss is obtained by weighing instances labelled by rules as min θ ( This is a method for joint learning of a rule and feature network in a teacher-student setup. Imply Loss (Awasthi et al., 2020): This approach uses additional information in the form of labelled rule exemplars and trains with denoised rule-label loss. Since it uses information in addition to what we assume, Imply Loss can be considered as a skyline for our proposed approaches.

Results with SPEAR
SPEAR uses the 'best' combination of the loss components L1, L2, L3, L4, L5, L6. To determine the 'best' combination, we perform a grid search over various combinations of losses using validation accuracy/f1-score as the criteria for selecting the most appropriate loss combination. Imply Loss uses a larger-sized validation set to tune their models. In our experiments, we maintained a validation set size equal to the size of the labelled data. In Table 5, we observe that SPEAR performs significantly better than all other approaches on all but the MIT-R data-set. Please note that all results are based on the same hand-picked labelled data subset as was chosen in prior work (Awasthi et al., 2020;Varma and Ré, 2018), except for Audit and Ionosphere. Even though we do not have rule-exemplar information in our model, SPEAR achieves better gains than even ImplyLoss. Recall that the use of ImplyLoss can be viewed as a skyline approach owing to the additional exemplar information that associates labelling functions with specific labelled examples. The slightly lower performance of the 'best' SPEAR on the MIT-R data-set can be partially explained by the fact that there are no LFs corresponding to the '0' class label, owing to which our graphical model is not trained for all classes. However, as we will show in the next section, by suitably determining a subset of the data-set that can be labelled (using the facility location representation function), we achieve improved performance even on the MIT-R data-set (see Table 5). Also, note that in Table 5, we present results on two versions of Audit, one in which both the train and test set are balanced, and the other where the labelled training set is imbalanced. In the imbalanced case (where the number of positives are only 10%), we were unable to successfully run the ImplyLoss and Posterior-Reg models (and hence the '-'), despite communication with the authors. We see that SPEAR and similarly, SPEAR-SS (discussed below) significantly outperform the baselines by almost 40% in the imbalanced case. In the balanced case, the gains are similar to what we observe on the other datasets.  Table 5: Performance of SPEAR and SPEAR-SS for three subset selection schemes on seven data-sets. All numbers reported are gains over the baseline method (Only-L). All results are averaged over 5 runs. Numbers in brackets '()' represent standard deviation of the original score. Handpicked instances refers to instances selected from the dataset for designing LFs. These instances are taken directly from (Awasthi et al., 2020) to ensure fair comparison.

Results with SPEAR-SS
Recall that all results discussed so far (including those for SPEAR) on the Youtube, SMS, MIT-R, IMDB and Census datasets were based on the same 'hand-picked' labelled data subset as in prior work (Awasthi et al., 2020;Varma and Ré, 2018).
In the case of Audit and Ionosphere, the labelled subset was randomly picked. In Table 5, we summarise the results obtained by employing supervised and unsupervised subset selection schemes for picking the labelled data-set and present comparisons against results obtained using (i) 'handpicked' labelled data-sets, and (ii) random selection of the labelled set. In each case, the size of the subset is the same, which we set to be the size of the hand-picked labelled set. Our data selection schemes are applied to the 'best' SPEAR model obtained across various loss components. We observe that the best-performing model for the supervised and unsupervised data selection tends to outperform the best model based on random selection. Secondly, we observe that between the supervised and unsupervised data selection approaches, the supervised one tends to perform the best, which means that using the hypothesised labels does help. Thirdly, we observe that YouTube, MIT-R, IMDB and Audit using the selected subset outperform prior work that employ hand picked data-set, whereas, in the case of SMS, Census and Ionosphere, we come close. Finally, our approach is more stable than other approaches as the standard deviation of SPEAR is low for 5 different runs across all the datasets. As an illustration, the examples such as S 3 and S 4 referred to in Section 1.1 were precisely obtained through supervised subset selection in SPEAR-SS, to form part of the labelled dataset. As previously observed in Table 2, S 3 and S 4 complement (via n-grams features such as F1 and F2) the effect of the labelling functions LF1 and LF2 on the unlabelled examples such as S 1 and S 2 , when included in the labelled set. Further detailed results with subset selection, etc. can be found in the appendix. In general, we observe that when the subset of instances selected for labelling is complementary to the labelling functions (as in our case), the performance is higher than when the labelled examples (exemplars) are inspired by labelling functions themselves as done in the work by Awasthi et al. (2020).

Significance Test
We employ the Wilcoxon signed-rank test (Wilcoxon, 1992) to determine whether there is a significant difference between SPEAR and Imply Loss (current state-of-the-art). Our null hypothesis is that there is not significant difference between SPEAR and Imply loss. For n = 7 instances, we observe that the one-tailed hypothesis is significant at p < .05, so we reject the null hypothesis. Clearly, SPEAR significantly outperforms Imply loss and, therefore, all previous baselines.
Similarly, we perform the significance test to assess the difference between SPEAR-SS and Imply Loss. As expected, the one-tailed hypothesis is significant at p < 0.05, which implies that our SPEAR-SS approach significantly outperforms Imply Loss, and thus all other approaches.

Conclusion
We study how data programming can benefit from labelled data by learning a model (SPEAR) that  Table 6: Comparison of SPEAR and SPEAR-SS against ImplyLoss on subset of datasets from Table 5 for which ImplyLoss used a much larger validation set than |L|. JL uses a validation set sizes equal to |L|.
jointly optimises the consensus obtained from labelling functions in an unsupervised manner, along with semi-supervised loss functions designed in the feature space. We empirically assess the performance of the different components of our joint loss function. As another contribution, we also study some subset selection approaches to guide the selection of the labelled subset of examples. We present the performance of our models and present insights on both synthetic and real datasets. While outperforming previous approaches, our approach is often better than an exemplar-based (skyline) approach that uses the additional information of the association of rules with specific labelled examples. A Illustration of SPEAR on a synthetic setting Through a synthetic example, we illustrate the effectiveness of our formulation of combining semisupervised learning with labelling functions (i.e., combined Losses 1-6) to achieve superior performance. Consider a 3-class classification problem with overlap in the feature space as depicted in Figure 1. The classes are A, B and C. Though we illustrate the synthetic setting in 2 dimensions, in reality, we performed similar experiments in three dimensions (and results were similar). We ran- Figure 1: Synthetic data domly pick 5 points from each class i ∈ {a, b, c}, and corresponding to each such point (x i , y i ) we create a labelling function based on its coordinates: • LF a : Consider the point (x a , y a ). The corresponding LF will be: if y ≥ y a return 1 (i.e. classify as class A) else return 0 (abstain).
• LF b : Similarly for (x b , y b ) the LF will return 1 if x ≤ x b and else will return 0.
• LF c : The LF corresponding to (x c , y c ) will return 1 if x ≥ x c and else will return 0.
These seemingly 15 weak labelling functions (5 for each class) actually aid in classification when the labelled example set is extremely small and the classifier is unable to get a good estimate of the class boundaries. This can be observed in Table 7 wherein we report the F1 score on a held out test dataset for models obtained by training on the different loss components. The results are reported in the case of three dimensions, wherein each circle was obtained as a 3-dimensional gaussian. The means for the three classes A, B and C were respectively, (0, 0, 0), (0, 1, 0) and (0, 0, 1) and the variance for each class was set to (1, 1, 1  We make the following important observations with respect to Table 7: (1) Skyline: When the entire training data is treated as labelled and loss function L1 is minimized, we obtain a skyline model with F1 score of 0.584. (2) With just 1% labelled data on L1, we achieve 0.349 F1 score (using only the labelled data). (3) We obtain an F1 score of 0.28 using the labelling functions on the unlabelled data (for L5) in conjunction with the 1% labelled data (for L4). (4) When the 1% labelled data (for L1) and the remaining observed unlabelled data (for L2) are used to train the semi-supervised model using L1+L2, an F1 score of 0.352 is obtained. (5) However, by jointly learning on all the loss components, we observe an F1 score of 0.44. This is far better than the numbers obtained using only (semi)supervised learning and those obtained using only the labelling functions. Understandably, this number is lower than the skyline of 0.584 mentioned on the first row of Table 7.

B Network Architecture
To train our model on the supervised data L, we use a neural network architecture having two hidden layers with ReLU activation. We chose our classification network to be the same as (Awasthi et al., 2020). In the case of MIT-R and SMS, the classification network contain 512 units in each hidden layer whereas the classification network for Census has 256 units in its hidden layers. For the YouTube dataset, we used a simple logistic regression as a classifier network, again as followed in (Awasthi et al., 2020). The features as well as the labelling functions for each dataset are also directly obtained from Snorkel (Ratner et al., 2016) and (Awasthi et al., 2020). Please note that all experiments (barring those on subset selection) are based on the same hand-picked labelled data subset as was chosen in (Awasthi et al., 2020).
In each experiment, we train our model for 100 epochs and early stopping was performed based on the validation set. We use Adam optimizer with the dropout probability set to 0.8. The learning rate for f and g network are set to 0.0003 and 0.001 respectively for YouTube, Census and MIT-R datasets. For SMS dataset, learning rate is set to 0.0001 and 0.01 for f and g network. For Ionosphere dataset, learning rate for f is set to 0.003. For each experiment, the numbers are obtained by averaging over five runs, each with a different random initialisation. The model with the best performance on the validation set was chosen for evaluation on the test set. As mentioned previously, the experimental setup in (Awasthi et al., 2020) surprisingly employed a large validation set. For fairness, we restrict the size of the validation set and keep it equal to the size of the labelled set. For all experiments involving comparison with previous approaches, we used code and hyperparameters from (Awasthi et al., 2020) but with our smaller sized validation set.
Following (Awasthi et al., 2020), we used binary-F1 as an evaluation measure for the SMS, macro-F1 for MIT-R datasets, and accuracy for the YouTube and Census datasets.

C Optimisation Algorithms and Submodularity: Lazy Greedy and Memoization
Both f unsup (X) and f sup (X) are submodular functions, and for data selection, we select a subset  Table 9: Performance on the test data, of various loss combinations from our objective function in equation (3). For each dataset, the numbers in bold refer to the 'best' performing combination, determined based on performance on the validation data-set. In general, we observe that all the loss components (barring L2) contribute to the best model. Note that all combinations includes QG (Component 7).
X of the unlabelled data, which maximises these functions under a cardinality budget (i.e. a labelling budget). For cardinality constrained maximisation, a simple greedy algorithm provides a near optimal solution (Nemhauser et al., 1978).
Starting with X 0 = ∅, we sequentially update X t+1 = X t ∪ argmax j∈V \X t f (j|X t ), where f (j|X) = f (X ∪ j) − f (X) is the gain of adding element j to set X. We run this till t = k and |X t | = k, where k is the budget constraint. It is easy to see that the complexity of the greedy algorithm is O(nkT f ) where T f is the complexity of evaluating the gain f (j|X) for the supervised and unsupervised facility location functions. This simple greedy algorithm can be significantly optimized via a lazy greedy algorithm (Minoux, 1978). The idea is that instead of recomputing f (j|X t ), ∀j / ∈ t , we maintain a priority queue of sorted gains ρ(j), ∀j ∈ V . Initially ρ(j) is set to f (j), ∀j ∈ V . The algorithm selects an element j / ∈ X t , if ρ(j) ≥ f (j|X t ), we add j to X t (thanks to submodularity). If ρ(j) ≤ f (j|X t ), we update ρ(j) to f (j|X t ) and re-sort the priority queue. The complexity of this algorithm is roughly O(kn R T f ), where n R is the average number of re-sorts in each iteration. Note that n R ≤ n, while in practice, it is a constant thus offering almost a factor n speedup compared to the simple greedy algorithm. One of the parameters in the lazy greedy algorithms is T f , which involves evaluating f (X ∪ j) − f (X). One option is to do a naïve implementation of computing f (X ∪j) and then f (X) and take the difference. However, due to the greedy nature of algorithms, we can use memoization and maintain a precompute statistics p f (X) at a set X, using which the gain can be evaluated much more efficiently (Iyer and Bilmes, 2019). At every iteration, we evaluate f (j|X) using p f (X), which we call f (j|X, p f ).
We then update p f (X ∪ j) after adding element j to X. Both the supervised and unsupervised facility location functions admit precompute statistics thereby enabling further speedups.

D Role of different components in the loss function
Given that our loss function has seven components (including the quality guides), a natural question is 'how do we choose among the different components for joint learning (JL)?' Another question we attempt to answer is 'whether all the components are necessary for JL?' For our final model (i.e., the results presented in Tables 6 and 7 of the main paper), we attempt to choose the best performing JL combination of the 7 loss components, viz. L1, L2, L3, L4, L5, L6. To choose the 'best' JL combination, we evaluate the performance on the validation set of the different JL combinations. Since we generally observe considerably weaker performance by selecting lesser than 3 loss terms, we restrict ourselves to 3 or more loss terms in our search. We report performance on the test data, of various JL combinations from our objective function for each of the four data-sets. For each data-set, the numbers in bold refer to the 'best' performing JL combination, determined based on performance on the validation data-set. The observations on the results are as follows. Firstly, we observe that all the loss components (barring L2 for three datasets) contribute to the best model. Furthermore, we observe that the best JL combination (picked on the basis of the validation set) either achieves the best performance or close to best among the different JL combinations as measured on the test dataset. Secondly, we observe that QGs do not cause significant improvement in the performance during training.