KnowMAN: Weakly Supervised Multinomial Adversarial Networks

The absence of labeled data for training neural models is often addressed by leveraging knowledge about the specific task, resulting in heuristic but noisy labels. The knowledge is captured in labeling functions, which detect certain regularities or patterns in the training samples and annotate corresponding labels for training. This process of weakly supervised training may result in an over-reliance on the signals captured by the labeling functions and hinder models to exploit other signals or to generalize well. We propose KnowMAN, an adversarial scheme that enables to control influence of signals associated with specific labeling functions. KnowMAN forces the network to learn representations that are invariant to those signals and to pick up other signals that are more generally associated with an output label. KnowMAN strongly improves results compared to direct weakly supervised learning with a pre-trained transformer language model and a feature-based baseline.


Introduction
Neural approaches rely on labeled data sets for training. For many tasks and languages, such data is either scarce or not available at all. Knowledgebased weak supervision tackles this problem by employing labeling functions (LFs). LFs are manually specified properties, e.g. keywords, that trigger the automatic annotation of a specific label. However, these annotations contain noise and biases that need to be handled.
A recent approach for denoising weakly supervised data is Snorkel (Ratner et al., 2020). Snorkel focuses on estimating the reliability of LFs and of the resulting heuristic labels. However, Snorkel does not address biases on the input side of weakly supervised data, which might lead to learned representations that overfit the characteristics of specific LFs, hindering generalization. We address the problem of overfitting to the LFs in this paper.
Other approaches tackle such overfitting by deleting the LF signal completely from the input side of an annotated sample: For example, Go et al. (2009) strip out emoticons that were used for labeling the sentiment in tweets, and Alt et al. (2019) mask the entities used for distant supervision of relation extraction training data (Mintz et al., 2009). However, as LFs are often constructed from the most prototypical and reliable signals (e.g., keywords), deleting them entirely from the feature space might -while preventing over-reliance on them -hurt prediction quality considerably. However, we find a way to blur the signals of the LFs instead of removing them.
In this work we propose KnowMAN (Knowledge-based Weakly Supervised Multinomial Adversarial Networks), a method for controllable soft deletion of LF signals, allowing a trade-off between reliance and generalization. Inspired by adversarial learning for domain adaptation (Chen and Cardie, 2018a;Ganin and Lempitsky, 2015), we consider LFs as domains and aim to learn a LF-invariant feature extractor in our model. KnowMAN is composed of three modules: a feature extractor, a classifier, and a discriminator. Specifically, KnowMAN employs a classifier that learns the actual task and an adversarial opponent, the LF-discriminator, that learns to distinguish between the different LFs. Upstream of both is the shared feature extractor to which the gradient of the classifier and the reversed gradient of the discriminator are propagated. In our experiments, the feature extractor for encoding the input is a multi-layer perceptron on top of either a bag-of-words vector or a transformer architecture, but KnowMAN is in principle usable with any differentiable feature extractor.
KnowMAN consistently outperforms our baselines by 2 to 30% depending on the dataset. By setting a hyperparameter λ that controls the influence of the adversarial part we can control the degree of Figure 1: KnowMAN architecture. The figure depicts one iteration over a batch of inputs. The parameters of C and F s are updated together, following the green arrows. The LF discriminator D is updated following the red arrows. Solid lines indicate forward, dashed lines the backward pass. discarding the information of LF-specific signals. The optimal λ value depends on the dataset and its properties.
The contributions of this work are i) proposing an adversarial architecture for controlling the influence of signals associated with specific LFs, ii) consistent improvements over weakly supervised baselines, iii) release of our code 1 . To our knowledge, we are the first that apply adversarial learning to overcome the noisiness of labels in weak supervision.

Method
Our approach is composed of three interacting modules i) the shared feature extractor F s , ii) the classifier C and iii) the LF discriminator D. The loss function of C rewards the classifier C for predicting the correct label for the instance, and the gradient is used for optimizing the shared feature extractor and classifier modules towards that goal. At the same time, the loss function for the LF-discriminator D rewards predicting which LF was responsible for labeling an instance. However, in adversarial optimization, KnowMAN backpropagates the reversed gradient for the LF-discriminator, hence the information indicative for distinguishing between specific LFs is weakened throughout the network. The hyperparameter λ is used to control the level 1 https://github.com/LuisaMaerz/KnowMAN of weakening the signals -the higher we choose the value the more influence is assigned to the discriminator information that goes into D. The result of the interplay between classifier and LF-discriminator is a shared feature representation that is good at predicting the labels while reducing the influence of LF-specific signals, encouraging the shared feature extractor to take other information (correlated with all LFs for a class) into account.
In Figure 1, the arrows illustrate the training flow of the three modules. Due to the adversarial nature of the LF discriminator D, it has to be trained with a separate optimizer (red arrows), while the rest of the network is updated with the main optimizer (green arrows). When D is trained the parameters of C and F s are frozen and vice versa.
To calculate the losses we utilize canonical negative log-likelihood loss (NLL) and use it for both, the classifier and the LF discriminator. The classification NLL can be formalized as: where y i is the (weakly supervised) annotated label andŷ i is the prediction of the classifier module C, for a training sample i. Analogously, we can define the NLL for the LF discriminator: where lf i is the actual LF used for annotating sample i andlf i is the predicted LF by the discriminator D. Accordingly, we minimize two different objectives within KnowMAN: Here the shared feature extractor has two different objectives: i) help C to achieve better classification performance and ii) make the feature distribution invariant to the signals from the LFs. This is captured by the shared objective: where λ is the parameter that controls the adversarial influence i.e. the degree of LF signal blur. −J D is the reversed loss of the LF discriminator D that represents Cs adversarial opponent. In general, the exact implementation or architecture of the individual modules is interchangeable and can be set up as required. This makes KnowMAN a universally applicable and easily customizable architecture.

Data
For our experiments we use three standard datasets for weak supervision.
Spam. Based on the YouTube comments dataset (Alberto et al., 2015) there is a smaller Spam dataset from Snorkel (Ratner et al., 2020) where the task is to classify if a text is relevant to a certain YouTube video or contains spam. This dataset is very small and does consist of a train and a test set only. For the 10 LFs keywords and regular expressions are used.
Spouse. This dataset for extracting the spouse relation has also been created by Snorkel, it is based on the Signal Media One-Million News Articles Dataset (Corney et al., 2016). The 9 LFs use information from a knowledge base, keywords and patterns. One peculiarity of this dataset is that over 90% of the instances do not hold a spouse relation.
IMDb. The IMDb dataset contains movie reviews that should be classified in terms of their sentiment (binary, positive or negative sentiment). The LFs used for this dataset are occurrences of positive and negative keywords from (Hu and Liu, 2004). A particular characteristic of this data set is the large amount of 6800 LFs, which constitutes a particular challenge to the Snorkel denoising framework. As a result Snorkel fails to calculate its generative model, since its memory consumption exceeds the available limit of 32GB RAM.

Experimental setup
For the experiments we use two different methods for encoding the input: i) TF-IDF encoding and ii) a DistilBERT transformer. For TF-IDF encoding, we vectorize 2 the input sentences and feed them to a simple MLP. In the transformer setting, the sequences of words are encoded using a pretrained DistilBERT. Similar to BERT (Devlin et al., 2019), DistilBERT is a masked transformer language model, which is a smaller, lighter, and faster version leveraging knowledge distillation while retaining 97% of BERT's language understanding capabilities (Sanh et al., 2019).
Our encoder takes the representation of the CLS token from a frozen DistilBERT and learns a nonlinear transformation with a drop-out layer to avoid overfitting (Srivastava et al., 2014): where DistilBERT (.) [CLS] generates the hidden state of the BERT's classifier token (CLS) and the function f represents a linear transformation for the i th sentence.
The classifier and discriminator networks following the feature extractor are in line with the implementation of Chen and Cardie (2018a) for domainadversarial learning. Both are simple sequential models with dropout, batch normalization, ReLU activation and softmax as the last layer. Please see our code for implementation details. In the TF-IDF setup we use Adam (Kingma and Ba, 2014) for both optimizers. When using transformer encoding the D optimizer again is Adam and the C optimizer is AdamW (Loshchilov and Hutter, 2018), as this yielded more stable results.
Baselines For each input encoding we implemented several baselines. Weakly supervised TF-IDF (WS TF-IDF) and Weakly supervised Distil-BERT (WS DistilBERT). Both calculate the labels for each instance in the train set based on their matching LFs. WS TF-IDF directly applies a logistic regression classifier to the input and the calculated labels. WS DistilBERT directly uses the DistilBERT uncased model for English (Sanh et al., 2019) as a prediction model. The second baseline (Feature TF-IDF, Feature DistilBERT) uses feature extractor and classifier layers of KnowMAN without taking the information of D into account (this is equal to setting λ to zero). We also fine-tuned the pure language model (Fine-tuned DistilBERT) without further transformations and without integrating the KnowMAN architecture.
We also compare with training TF-IDF and Dis-tilBERT models on labels denoised by Snorke (Snorkel TF-IDF, Snorkel DistilBERT). However, Snorkel denoising failed for the IMDb data set due to the large amount of LFs.
KnowMAN We refer to the KnowMAN architecture as TF-IDF KnowMAN and DistilBERT KnowMAN. Depending on the dataset we choose different λ values. We also implemented two ways of evaluation and best model saving during training: i) evaluate after each batch and save the best model, ii) evaluate after a certain number of steps in between the batches and save the best model. Hyperparameters We perform hyperparameter tuning using Bayesian optimization (Snoek et al., 2012) for the IMDb and Spouse datasets. For Spam, hyperparameters are not optimized, as no validation set is available. Sampling history and resulting hyperparameters are reported in the Appendix, Figures 2, 3 as well as hyperparameters chosen for the Spam data set.
Evaluation For the evaluation of the IMDb and the Spam datasets we use accuracy, for the Spouse dataset we use the macro F1 score of the positive class. To check statistical significance we use randomized testing (Yeh, 2000). Results are considered significant if ρ < 0.05.

Results
The results of the experiments are shown in Table 1. For the TF-IDF setup KnowMAN TF-IDF outperforms the baselines across all datasets. We find the optimal λ values as follows: Spam/Spouse/IMDb = 2/5/4.9. Using the additional feature extractor layer (Feature TF-IDF) is beneficial compared to direct logistic regression for all datasets. Snorkel TF-IDF can outperform the other two baselines for the Spouse dataset only.
Fine tuning of DistilBERT can not outperform our best KnowMAN. However, for the Spam dataset Fine-tuned DistilBERT gives better results than KnowMAN DistilBERT but still is worse than KnowMAN TF-IDF. Using WS DistilBERT gives the same results for the Spam dataset and slightly better results for IMDb, when compared to WS TF-IDF, for Spouse the performance decreases. Snorkel DistilBERT can outperform the other two baselines for the Spam dataset only. The low performance of Snorkel on IMDb (for both DistilBERT and TF-IDF) might be explained by the very large amount of LF for this dataset. The KnowMAN DistilBERT results across datasets are in line with the TF-IDF setup -KnowMAN can outperform all baselines for the Spouse and IMDb dataset. We observe that λ = 5 for Spouse and λ = 1 for IMDb is most beneficial when using DistilBERT. For the Spam dataset we observe that KnowMAN (with λ = 2) outperforms all the baselines, except for the fine-tuned DistilBERT model.

Discussion
The performance drop we observe with DistilBERT for KnowMAN compared to the tf-idf setup of the IMDb dataset could be explained by implementation details. Due to memory issues we have to truncate the input when using Distil-BERT. Since the movie reviews from IMDb are rather long this could harm performance. Since the Spam dataset is very small a single wrongly classified instance can have great impact on the results. This could explain why KnowMAN TF-IDF outperforms KnowMAN DistilBERT here as well. In general we could not perform hyperparameter optimization for the DistilBERT experiments due to memory issues. Therefore the results for that experiments might not have reached their optimum. However, the results show the value of using KnowMAN though. Overall our results confirm the assumption that KnowMAN enables a focus shift of the shared feature extractor from the signals of the LFs towards signals of other valuable information. KnowMAN consistently improves over the other experiments significantly -except for the Spam dataset. We assume that the dataset size is too small to see significant changes in the results. Compared to the implementation of Chen and Cardie (2018a) we could not use the specialized domain feature extractor for our datasets in the experiments. This is due to the fact that our test sets do not contain information about LF matches. However, we will address this issue by integrating a mixture of experts module for the specialized feature extractor as recommended by Chen et al. (2019).

Related Work
Adversarial neural networks have been used to reduce the divergence between distributions, such as Goodfellow et al. (2014), Chen et al. (2018) and Ganin and Lempitsky (2015). The latter proposed an architecture for gradient reversal and a shared feature extractor. Unlike us, they focused on a binary domain discriminator. Similarly, (Chen and Cardie, 2018a) use an adversarial approach in a multinomial scenario for domain adaptation.
Some works on adversarial learning in the context of weak supervision focus on different aspects and only share similarity in name with our approach: Wu et al. (2017) use virtual adversarial training (Miyato et al., 2017) for perturbing input representations, which can be viewed as a general regularization technique not specific to weakly supervised learning. Qin et al. (2018);Zeng et al. (2018) use generative adversarial mechanisms for selecting negative training instances that are difficult to discriminate from heuristically annotated ones for a classifier.
Several approaches have focused on denoising the labels for weakly supervised learning (Takamatsu et al., 2012;Manning et al., 2014;Lin et al., 2016). Snorkel (Ratner et al., 2020) is one of the most general approaches in this line of work. However, Snorkel only models biases and correlations of LFs, and does not consider problems of weak supervision that may stem from biases in the features and learned representations.
A recent approach that focuses on denoising weakly supervised data is (Sedova et al., 2021). Knodle is a framework for comparison of different methods that improve weakly supervised learning. We use some of their datasets for our approach but denoise the signals of the LFs during training.

Conclusion
We propose KnowMAN -an adversarial neural network for training models with noisy weakly supervised data. By integrating a shared feature extractor that learns labeling function invariant features, KnowMAN can improve results on weakly supervised data drastically across all experiments and datasets in our setup. The experiments also show that the adverse effect of labeling function-specific signals is highly dependent on the datasets and their properties. Therefore, it is crucial to fine-tune the λ parameter on a validation set to find the optimal degree of blurring the labeling function signals. Since the modules in the KnowMAN architecture are easily exchangeable, KnowMAN can be applied to any architecture and dataset labeled with heuristic labeling functions.

A Appendix
A.

A.2 Hyperparameter optimization
We perform hyperparameter tuning using Bayesian optimization (Snoek et al., 2012). Bayesian Optimization is an approach that uses the Bayes Theorem to direct the search in order to find the minimum or maximum of a black-box objective function. In comparison with random search and grid search, it tends to obtain better hyperparameters in fewer steps by making a proper balance between exploration and exploitation steps. Our hyperparameter space includes batch size, dropout, number of iterations over D, the shared hidden size of the models, learning rate for D and F s , C and the number of layers of C, D and F s . We implemented two ways of evaluation and best model saving during training: i) evaluate after each batch and save the best model, ii) evaluate after a certain number of steps in between the batches and save the best model. We also optimized the number of steps if logging in between a batch. We evaluated the models for IMDb and Spouse on the respective validation set. For the Spam dataset, there is no development set available and we used the following hyperparameters for KnowMAN TF-IDF following the parameters used in Chen and Cardie (2018b): Batch size: 32, dropout: 0.4, n critic: 5, lambda: 2.0, shared hidden size: 700, learning rate C & F: 0.0001, learning rate D: 0.0001 , number of F layers: 1, number of C layers: 1, number of D layers: 1.

A.3 Experimental details
We ran our experiments on a DGX-1 server with one V100 GPU per experiment. The runtime of one model depends on the dataset: 0.25 hours for the Spam dataset, 0.25 hours for the Spouse dataset, and 8 hours for the IMDb dataset. Please find our implementation at https:// github.com/LuisaMaerz/KnowMAN.